pash
pash copied to clipboard
Performance drop on multiple file inputs
When running pash for grep on two files of different sizes, the performance time is sub-optimal, specifically with a parallelization width of 2.
i.e. Row 2 column 1 of the table shows speedup of pash on a 1 mb and a 64 mb sized files. Meanwhile row 3 column 1 of the table shows speedup of pash after cat-ing the two unequal-sized files into a single 65 mb file. Weirdly, this performance loss appears to diminish heavily and even disappear on higher widths of 4 and 8 respectively. (see columns 2 and 3 of the table).
Averaged over 5 repetitions Time in seconds Formatting: Sequential time / pash time (speedupX)
Width=2 | Width=4 | Width=8 | |
---|---|---|---|
One file ~64 mb | 26.9076 / 14.1222 (1.90534X) | 26.179 / 9.0142 (2.9042X) | 26.1724 / 7.7974 (3.35655X) |
Two unequal sizes ~1 mb, ~64 mb | 27.2162 / 26.8302 (1.0144X) | 26.526 / 9.1956 (2.88464X) | 26.7292 / 7.8822 (3.3911X) |
Combined one file ~65 mb | 26.684 / 14.3274 (1.8624X) | 26.5346 / 8.9382 (2.96867X) | 26.4956 / 7.881 (3.36196X) |
Two equal sizes ~64 mb each | 53.8 / 27.355 (1.9667X) | 53.0384 / 17.8858 (2.96539X) | 53.043 / 15.3572 (3.45395X) |
Command used for evaluation: 'cat fileA fileB | grep '(.).\1(.).\2(.).\3(.).\4' | wc -l' Code used for evaluation (width=2): https://drive.google.com/file/d/1aGOYvXawq4axEAxXj9EAO0HJuHsq_6mv/view?usp=sharing
Current hypothesis: The new round-robin split assumes both/all files in the input stream are the same size and partitions workers accordingly. However, what would explain the performance loss disappearing on higher widths?
Thanks for posting that. PaSh currently does not take into account the size of input files during optimization.
Since split
is not free (it needs to do a pass over the data) PaSh does not add a split
after a cat
it if the number of input files corresponds with the width. Therefore, with two input files and width = 2, PaSh sends the 1mb file to one parallel copy of grep
and the 64mb to the other one. This can be observed if you run PaSh with -d 1
and observe the parallel script that is produced.
Regarding widths 4 and 8, PaSh will add a split
after the cat
since width does not correspond to the number of input files, therefore leading to balanced work on the parallel copies of the grep
command.
I hope that answers the question.
I see, that makes sense. Thanks for the clarification!
Would it make sense then for pash to first check the sizes of the input files (maybe through metadata) and if they are drastically different (i.e. the ratio of the file sizes is over some threshold), cat
and then split
them, even if the number of input files corresponds with the width?