This article demonstrates one possible method of merging N sorted large text files using Forecasting algorithm in the style suggested by D.Knuth. The demo app “for_merge” will merge them faster than standard Unix “sort” utility by starting test suite. This code also could be useful for implementation of second part of Merge-Sort algorithm (where merging takes place). Note, that Input files are generated before test suite starts.
- 1. Introduction
- 2. Implementation details
- 2.1 Working environment
- 2.2 Source code overview
- 2.3 Illustration of library usage with an example
- 3. Launching
- 3.1 Command line parameters
- 3.2 Examples
- 4. Testing
- 4.1 Generating input data files
- 4.2 Launching test suite
- 5. Results of testing
- 5.1 SMALL data set
- 5.2 BIG data set
- 5.3 Conclusion
- 6. TODO
External merging is a term for a class of merging algorithms that can handle massive amounts of data. External merging is required when the data being merged do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive).
P-way merge is a more general algorithm that allows to merge N pre-sorted lists into one.
Forecasting merge algorithm (P-way merge using 2P + 2 buffers) was suggested by D.Knuth in “The Art of Computer Programming” volume 3, exercise 5.4.6F. It keeps track of the buffer that will be emptied first and uses an extra buffer to read the appropriate next part from the disk, while the contents of the remaining buffers are processed.
2. Implementation details
2.1 Working environment
The article requires Linux development environment for compiling code and running tests that includes
- Linux OS on PC platform with multiprocessing system;
- GNU toolchains with gcc and pthread lib for building C sources.
2.2 Source code overview
Size of available memory is obtained by means of the
get_memory_available() function where 1)system and 2)
RLIMIT restrictions are taken into account.
The size of the available memory is used for setting optimal size of merge buffers: P input buffers and 1 output (also it can be set as parameters of the program).
fill_input_buffer_thread() function fills forecast buffer in background.
open_input_files() function opens the input files for reading and the output one for writing.
The main function,
merge_fps(), manipulates buffers in such manner that as soon as the end of the input buffer is reached, the “forecasted buffer” is ready to use. So, there are no time lags with the next buffer preparation.
compare() function is rather standard: it compares two lines A and B, returning negative, zero, or positive depending on whether A is less than, equal to, or greater than B.
Current implementation uses 2 threads. First one is for reading data from disk and preparing input buffer. Another one performs comparison of lines of input buffers and writes data to output bufferdisk.
It doesn’t use the 3-rd thread for writing data from output buffer to disk because of very small efficiency: tests show that it gives ~2-2% of performance increase. The reasons of this situation are very big cache size of modern disks and effective synchronous operations of read-write.
3.1 Command line parameters
To get the list of input parameters enter
Usage: for_merge [OPTION]... [FILE1]... Write concatenation of sorted FILEs to file. Options: -o, --output=OFILE MANDATORY - write result to OFILE -s, --buffer-size=SIZE use SIZE bytes for input buffer (2*SIZE for each input file) -S, --output-size=SIZE use SIZE bytes for output buffer -z, --zero-terminated end lines with 0 byte, not newline -h, --help print this help
Merge 3 sorted files into OutFileName, use 10Mb input and 50Mb output buffers:
for_merge -s 10000000 -S 50000000 --output OutFileName Input1 Input2 Input3
Merge 3 sorted files into OutFileName, calculate size of merge buffers and put log into file:
for_merge --output OutFileName Input1 Input2 Input3 > LOG.TXT 2>&1
The same as above plus summarize system resource usage and write this stats to the log file:
/usr/bin/time -a -o LOG.TXT -- for_merge --output OutFileName Input1 Input2 Input3 > LOG.TXT 2>&1
Test is performed on 3 input files, so, 3-way merge is tested.
There is “run_all.sh” script in the root, which performs such steps:
- builds sources
- prepares input data
- runs tests and prints results on terminal
NOTE It will take appr. 10-20 minutes per each test (including diff check)!
NOTE It will take appr. 20 Gbyte of free disk space!
4.1 Generating input data files
Script “prepare_all_data_files.sh” in “data” directory creates 2 files with sorted data:
- sorted_small_C.dat ~250 Mb, 36.000.000 text lines
- sorted_big_C.dat ~5 Gb, 730.000.000 text lines
Then each of these data files is divided into 3 parts with indexes “part1”, “part2”, “part3”.
The result of this script work is a set of such new files in “data” directory:
So, input data files are ready to start testing application.
4.2 Launching test suite
There are 3 scripts in “test_cases” directory:
- “run_all_small.pl” – runs 4 tests with SMALL data set
- “run_all_big.pl” – runs 1 test with BIG data set
- “run_all_std.sh” – runs standard merging with SMALL and BIG data sets using GNU “sort” utility
There are 3 appropriate (BIG or SMALL) input files (with indexes “part1”, “part2”, “part3”) in each of above test case. After merging, new file appears as the result (with index “result”, f.e. sorted_big_result_std.dat). Also *.rep files are created per each test with timememory consumption results. And at the end, this new merged file is checked with the original file from “data” directory.
5. Results of testing
Here is the comparison of time test results of our application “for_merge” and standard GNU “sort” utility. Intel CoreDuo platform was used with 2G RAM.
5.1 SMALL data set
- merging 3 sorted files ~85 Mb each
- 255906560 bytes output file, 36.000.000 text lines
for_merge (2 threads)
GNU sort (1 thread)
5.2 BIG data set
- merging 3 sorted files ~1.7 Gb each
- 5118131200 bytes output file, 730.000.000 text lines
for_merge (2 threads)
GNU sort (1 thread)
The obtained time test results demonstrate that “for_merge” is approximately 25% faster on “User time” parameter.
“System time” parameter is approximately equal because the sizes of input and output files are the same in both applications, and I/O operations are similar on the current hard drives with the huge cache memory.
“for_merge” shows more efficiency in merging because of using 2 threads in Forecasting algorithm. On the other hand, “GNU sort” is rather fast and efficient utility too, keeping in mind that it is developed for years by knowledgeable open source community.
- rewrite generation of input files using “pwgen” utility 😉
- improve the speed of I/O by using several hard drives. Recommendations from Knuth in v.3, 5.4.9
- play with different sizes of input and output buffers to improve the speed as mentioned in Knuth, v.3, 5.4.9
Get more Linux-related articles in our Dev Blog: How to write a device driver