This article demonstrates one possible method of merging N sorted large text files using Forecasting algorithm in the style suggested by D.Knuth. The demo app "for_merge" will merge them faster than standard Unix "sort" utility by starting test suite. This code also could be useful for implementation of second part of Merge-Sort algorithm (where merging takes place). Note, that Input files are generated before test suite starts.

Table of Contents
1. Introduction
2. Implementation details
2.1 Working environment
2.2 Source code overview
3. Launching
3.1 Command line parameters
3.2 Examples
4. Testing
4.1 Generating input data files
4.2 Launching test suite
5. Results of testing
5.1 SMALL data set
5.2 BIG data set
5.3 Conclusion
6. TODO

1. Introduction

External merging is a term for a class of merging algorithms that can handle massive amounts of data. External merging is required when the data being merged do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive).

P-way merge is a more general algorithm that allows to merge N pre-sorted lists into one.

Forecasting merge algorithm (P-way merge using 2P + 2 buffers) was suggested by D.Knuth in "The Art of Computer Programming" volume 3, exercise 5.4.6F. It keeps track of the buffer that will be emptied first and uses an extra buffer to read the appropriate next part from the disk, while the contents of the remaining buffers are processed.

2. Implementation details

2.1 Working environment

The article requires Linux development environment for compiling code and running tests that includes

  • Linux OS on PC platform with multiprocessing system;
  • GNU toolchains with gcc and pthread lib for building C sources.

2.2 Source code overview

Size of available memory is obtained by means of the get_memory_available() function where 1)system and 2)RLIMIT restrictions are taken into account.

The size of the available memory is used for setting optimal size of merge buffers: P input buffers and 1 output (also it can be set as parameters of the program).

The fill_input_buffer_thread() function fills forecast buffer in background.

The open_input_files() function opens the input files for reading and the output one for writing.

The main function, merge_fps(), manipulates buffers in such manner that as soon as the end of the input buffer is reached, the "forecasted buffer" is ready to use. So, there are no time lags with the next buffer preparation.

The compare() function is rather standard: it compares two lines A and B, returning negative, zero, or positive depending on whether A is less than, equal to, or greater than B.

Current implementation uses 2 threads. First one is for reading data from disk and preparing input buffer. Another one performs comparison of lines of input buffers and writes data to output buffer\disk.

It doesn't use the 3-rd thread for writing data from output buffer to disk because of very small efficiency: tests show that it gives ~2-2% of performance increase. The reasons of this situation are very big cache size of modern disks and effective synchronous operations of read-write.

3. Launching

3.1 Command line parameters

To get the list of input parameters enter

 for_merge --help

Result:

   Usage: for_merge [OPTION]... [FILE1]...
   Write  concatenation of sorted FILEs to file.
   Options:
   -o,  --output=OFILE MANDATORY - write  result to OFILE
   -s,  --buffer-size=SIZE use SIZE bytes for  input buffer (2*SIZE for each input file)
   -S,  --output-size=SIZE use SIZE bytes for  output buffer
   -z,  --zero-terminated end lines with 0  byte, not newline
   -h, --help print this help

3.2 Examples

Merge 3 sorted files into OutFileName, use 10Mb input and 50Mb output buffers:

   for_merge -s 10000000 -S 50000000 --output OutFileName Input1 Input2 Input3

Merge 3 sorted files into OutFileName, calculate size of merge buffers and put log into file:

   for_merge --output OutFileName  Input1 Input2 Input3 > LOG.TXT 2>&1

The same as above plus summarize system resource usage and write this stats to the log file:

   /usr/bin/time -a -o LOG.TXT -- for_merge  --output OutFileName Input1 Input2 Input3 >  LOG.TXT 2>&1

4. Testing

Test is performed on 3 input files, so, 3-way merge is tested.

There is "run_all.sh" script in the root, which performs such steps:

  1. builds sources
  2. prepares input data
  3. runs tests and prints results on terminal

NOTE It will take appr. 10-20 minutes per each test (including diff check)!

NOTE It will take appr. 20 Gbyte of free disk space!

4.1 Generating input data files

Script "prepare_all_data_files.sh" in "data" directory creates 2 files with sorted data:

  • sorted_small_C.dat ~250 Mb, 36.000.000 text lines
  • sorted_big_C.dat ~5 Gb, 730.000.000 text lines

Then each of these data files is divided into 3 parts with indexes "part1", "part2", "part3".

The result of this script work is a set of such new files in "data" directory:

  • sorted_big_C.dat
  • sorted_big_C_part1.dat
  • sorted_big_C_part2.dat
  • sorted_big_C_part3.dat
  • sorted_small_C.dat
  • sorted_small_C_part1.dat
  • sorted_small_C_part2.dat
  • sorted_small_C_part3.dat

So, input data files are ready to start testing application.

4.2 Launching test suite

There are 3 scripts in "test_cases" directory:

  • "run_all_small.pl" - runs 4 tests with SMALL data set
  • "run_all_big.pl" - runs 1 test with BIG data set
  • "run_all_std.sh" - runs standard merging with SMALL and BIG data sets using GNU "sort" utility

There are 3 appropriate (BIG or SMALL) input files (with indexes "part1", "part2", "part3") in each of above test case. After merging, new file appears as the result (with index "result", f.e. sorted_big_result_std.dat). Also *.rep files are created per each test with time\memory consumption results. And at the end, this new merged file is checked with the original file from "data" directory.

5. Results of testing

Here is the comparison of time test results of our application "for_merge" and standard GNU "sort" utility. Intel CoreDuo platform was used with 2G RAM.

5.1 SMALL data set

  • merging 3 sorted files ~85 Mb each
  • 255906560 bytes output file, 36.000.000 text lines

Utility

User time

System time

for_merge (2 threads)

7.55 secs

1.12 secs

GNU sort (1 thread)

10.06 secs

1.24 secs

5.2 BIG data set

  • merging 3 sorted files ~1.7 Gb each
  • 5118131200 bytes output file, 730.000.000 text lines

Utility

User time

System time

for_merge (2 threads)

154.49 secs

24.49 secs

GNU sort (1 thread)

200.05 secs

24.97 secs

5.3 –°onclusion

The obtained time test results demonstrate that "for_merge" is approximately 25% faster on "User time" parameter.

"System time" parameter is approximately equal because the sizes of input and output files are the same in both applications, and I/O operations are similar on the current hard drives with the huge cache memory.

"for_merge" shows more efficiency in merging because of using 2 threads in Forecasting algorithm. On the other hand, "GNU sort" is rather fast and efficient utility too, keeping in mind that it is developed for years by knowledgeable open source community.

6. TODO

  • rewrite generation of input files using "pwgen" utility ;-)
  • improve the speed of I/O by using several hard drives. Recommendations from Knuth in v.3, 5.4.9
  • play with different sizes of input and output buffers to improve the speed as mentioned in Knuth, v.3, 5.4.9

Download Project Source (TAR, 99 KB)

Get more Linux-related articles in our Dev Blog: How to write a device driver

Subscribe to updates