Sunday 16 February 2020

Parallel execution in shell script using GNU parallel

Occasionally, one needs to look for certain strings inside very large files.
Very large file here is a text file that is more than 1GB in size.
The file could be opened by Linux tools such as less, but it becomes challenging to navigate the file and search inside it.
Also if we have a much bigger file, things can become more complex to handle with a single threaded application like less.
Things can become even more challenging if we have multiple files of that size, then one needs to either use grep and sed to look for patterns and do edits without the need to load the file(s) in memory.
Even with those command line filters, which mainly act on the input stream, doing a casual grep can take time to provide results.

To solve this, we would need to use the GNU parallel command to make use of CPU parallel execution.
parallel allows commands to be run in separate processes either on the same host or on multiple hosts, that can run in parallel processes, those would then do processing for parts of the input. parallel then collects the output of those processes and generates the result of the execution jobs.

A simple example for running parallel is below:

sherif@ulmo ~/logs $ < catalina.out parallel --pipe grep -C 4 'OutOfMemory'

The above command pipes the catalina.out that is redirected to standard input to the grep -C to look for the string.
The above form uses defaults block size (1MB) and the default number of jobs will be created (same as number of CPU cores available corresponding to 100%).

Using more optimization for the parallel command, we can achieve more efficiency compared to running a single thread filter, below is a comparison done for execution time for a standard egrep done on a 2GB file and same using parallel with 8 jobs and 100MB block size: 

[root@feanor ~]# time egrep -ni "^/ditto" all_files.txt |wc -l
82

real    0m5.401s
user    0m4.773s
sys     0m0.210s


[root@feanor ~]# time parallel --pipepart --joblog joblog1 -a all_files.txt --block 100M -P8 -k egrep -ni "^/ditto" |wc -l
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence the citation notice: run 'parallel --bibtex'.

82

real    0m2.819s
user    0m7.224s
sys     0m1.801s
[root@feanor ~]#





As you can see, parallel was able to produce the same result in half the time taken by a single threaded egrep.
The fact that the user time is much higher for execution done by parallel, means that parallel made more use of CPU cores as the total CPU user time is higher than actual time (real) take by the process.
(for more information about the time command check this stackoverflow thread: https://unix.stackexchange.com/questions/40694/why-real-time-can-be-lower-than-user-time)

For more information and examples for using parallel, please check the excellent GNU parallel man page: https://www.gnu.org/software/parallel/man.html


No comments:

Post a Comment