pash icon indicating copy to clipboard operation
pash copied to clipboard

Performance drop on multiple file inputs

Open DanielSongShen opened this issue 3 years ago • 2 comments

When running pash for grep on two files of different sizes, the performance time is sub-optimal, specifically with a parallelization width of 2.

i.e. Row 2 column 1 of the table shows speedup of pash on a 1 mb and a 64 mb sized files. Meanwhile row 3 column 1 of the table shows speedup of pash after cat-ing the two unequal-sized files into a single 65 mb file. Weirdly, this performance loss appears to diminish heavily and even disappear on higher widths of 4 and 8 respectively. (see columns 2 and 3 of the table).

Averaged over 5 repetitions Time in seconds Formatting: Sequential time / pash time (speedupX)

Width=2 Width=4 Width=8
One file ~64 mb 26.9076 / 14.1222 (1.90534X) 26.179 / 9.0142 (2.9042X) 26.1724 / 7.7974 (3.35655X)
Two unequal sizes ~1 mb, ~64 mb 27.2162 / 26.8302 (1.0144X) 26.526 / 9.1956 (2.88464X) 26.7292 / 7.8822 (3.3911X)
Combined one file ~65 mb 26.684 / 14.3274 (1.8624X) 26.5346 / 8.9382 (2.96867X) 26.4956 / 7.881 (3.36196X)
Two equal sizes ~64 mb each 53.8 / 27.355 (1.9667X) 53.0384 / 17.8858 (2.96539X) 53.043 / 15.3572 (3.45395X)

Command used for evaluation: 'cat fileA fileB | grep '(.).\1(.).\2(.).\3(.).\4' | wc -l' Code used for evaluation (width=2): https://drive.google.com/file/d/1aGOYvXawq4axEAxXj9EAO0HJuHsq_6mv/view?usp=sharing

Current hypothesis: The new round-robin split assumes both/all files in the input stream are the same size and partitions workers accordingly. However, what would explain the performance loss disappearing on higher widths?

DanielSongShen avatar Jul 13 '21 19:07 DanielSongShen

Thanks for posting that. PaSh currently does not take into account the size of input files during optimization.

Since split is not free (it needs to do a pass over the data) PaSh does not add a split after a cat it if the number of input files corresponds with the width. Therefore, with two input files and width = 2, PaSh sends the 1mb file to one parallel copy of grep and the 64mb to the other one. This can be observed if you run PaSh with -d 1 and observe the parallel script that is produced.

Regarding widths 4 and 8, PaSh will add a split after the cat since width does not correspond to the number of input files, therefore leading to balanced work on the parallel copies of the grep command.

I hope that answers the question.

angelhof avatar Jul 13 '21 20:07 angelhof

I see, that makes sense. Thanks for the clarification!

Would it make sense then for pash to first check the sizes of the input files (maybe through metadata) and if they are drastically different (i.e. the ratio of the file sizes is over some threshold), cat and then split them, even if the number of input files corresponds with the width?

DanielSongShen avatar Jul 15 '21 19:07 DanielSongShen