Using GNU Parallel to speed up nested loops in bash
Get it right ...
At $ork we have a small Perl application that uses a CSV source file to generate configuration files that are consumed by Ecelerity. The application invocation looks like this:app.pl myconfig.csv some-binding some-dayThe CSV file is the master configuration, the binding is an ecelerity thing, and the day is the day-specific configuration we're generating. The output is a wad of text that contains the binding- and day-specific configuration.
Since we need to generate configuration files for many bindings and many days, the script is usually run many times in succession in a bash script, e.g.:
for binding in $bindings; doThis solution does a good job of separating concerns; each file is generated by a single call to the helper script, which makes the script simple and easy to maintain.
for day in $days; do
app.pl $csvfile $binding $day > $binding/$day.conf
done
done
... then make it fast!
The above solution was intended to be a stopgap solution, good for a few months at most. More than a year later, what was a stopgap is now part of the process. Over time, the number of bindings and the size of the configuration file have increased, resulting in the config generation time going from about 2 to about 10 minutes.
One early optimization was to limit the configuration files generated to days in the future. There's no sense in generating a config file for day 20 when it's day 21 now. This was a great improvement, but the run time crept up nonetheless.
Possible areas of improvement
Ten minutes is too long. What options do we have to speed it up? Ordinarily I'd want to use profiling to make sure I'm addressing the areas with the greatest opportunity, but two options present themselves based on experience with the process:- Execution of the wrapper shell script with the nested loops is single-threaded. This is running on a 16-core machine; can we take advantage of this?
- The CSV file is increasing in size because there's old data in it. Can we remove the old data, and thus speed up the parsing?
Improvement #1: using GNU Parallel
With a little fudging of our original helper script, we can replace the nested loops with a call to GNU Parallel, which allows us to easily run say 10 invocations of our Perl script in parallel. This allows us to make better use of the system resources that are available to us. Here's one way:# helper function to simplify our parallel invocation
function helper() {
csv=$1
binding=$2
day=$3
app.pl $csv $binding $day > $binding/$day.conf
}
# need to export the helper so that subshells
# invoked by parallel have access to it
export -f helper
# run 10 jobs in parallel
parallel --jobs 10 helper $csv ::: $bindings ::: $days
To debug this, I made use of the "--dry-run" option to parallel, which enumerates to STDOUT the commands the parallel command would run.
Right off the bat, the overall run time was reduced from 10 minutes to under 1 minute--a factor of 10 speedup in the script. Adding additional jobs to the processing doesn't reduce the run time further, so some other fundamental issue must be limiting our throughput.
Improvement #2: CSV source size reduction
Our CSV "spreadsheet" contains a large number of columns (one per day). One existing optimization mentioned earlier is not to regenerate config files for days in the past. Unfortunately it appears that the cost of parsing the larger files is catching up to us.I put together a robust and fast mechanism to remove unwanted columns, written in Perl. This script is run once, and it creates a temporary CSV file pruned to about 1/4 the size of the original file. When I run the parallelized version of the throttle generation code, the run time is reduced to under 30 seconds.
Not too shabby--a 95% reduction in run time overall!
Future directions
Some other changes suggest themselves:- Maybe cache the parse? This is hard to justify without some profiling.
- Improve the CSV pruning script for ease of use. As currently implemented, the pruning script identifies the columns to remove via regular expressions. A less general but more useful solution would just remove all columns that look like "Day N", where N is some argument to the script.
Comments