parquet-go icon indicating copy to clipboard operation
parquet-go copied to clipboard

Improve throughput of file merges

Open achille-roussel opened this issue 2 years ago • 2 comments

Merging parquet files is a critical operation for the use we are making of this library. Merging files is heavy on both the read and write paths, as files need to be decoded then re-encoded with the new content (e.g. rows may be re-ordered).

While we have invested in improving the write path, the algorithms involved on the read path are still trivial, yielding throughput way below the level of performance we need to achieve for production workloads.

This issue intends to track progress on improvement of the read throughput, especially in the context of merging parquet files.

achille-roussel avatar Jun 09 '22 05:06 achille-roussel

+1 improving performance on the read path will be very useful for us as well. From some profiles we can see that ~60-70% of CPU on our compactors is spent in reading + reconstructing rows.

annanay25 avatar Jun 09 '22 08:06 annanay25

Progress so far:

name                                                 old time/op  new time/op  delta
MergeFiles/BOOLEAN/groups=2,rows=20000               17.0µs ± 1%  16.8µs ± 1%    -0.87%  (p=0.001 n=8+10)
MergeFiles/INT32/groups=2,rows=20000                 57.5µs ± 0%   6.9µs ± 1%   -87.93%  (p=0.000 n=8+8)
MergeFiles/INT64/groups=2,rows=20000                 6.41µs ± 0%  6.43µs ± 1%      ~     (p=1.000 n=9+8)
MergeFiles/INT96/groups=2,rows=20000                 33.5µs ±13%  36.4µs ±26%      ~     (p=0.393 n=10+10)
MergeFiles/FLOAT/groups=2,rows=20000                 6.21µs ± 1%  6.29µs ± 1%    +1.23%  (p=0.000 n=9+10)
MergeFiles/DOUBLE/groups=2,rows=20000                6.61µs ± 1%  6.65µs ± 1%    +0.65%  (p=0.010 n=9+10)
MergeFiles/BYTE_ARRAY/groups=2,rows=20000            46.0µs ± 1%  18.0µs ± 0%   -60.95%  (p=0.000 n=10+8)
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000  7.67µs ± 5%  7.81µs ± 5%      ~     (p=0.148 n=10+10)
MergeFiles/STRING/groups=2,rows=20000                20.4µs ± 2%  11.1µs ± 0%   -45.73%  (p=0.000 n=8+9)
MergeFiles/STRING_(dict)/groups=2,rows=20000         20.1µs ± 0%  19.2µs ± 0%    -4.08%  (p=0.000 n=10+8)
MergeFiles/UUID/groups=2,rows=20000                  41.5µs ± 0%  16.0µs ± 2%   -61.46%  (p=0.000 n=8+9)
MergeFiles/DECIMAL/groups=2,rows=20000               6.44µs ± 0%  6.46µs ± 0%    +0.39%  (p=0.000 n=9+9)
MergeFiles/AddressBook/groups=2,rows=20000            233µs ± 1%   219µs ± 0%    -6.36%  (p=0.000 n=8+8)
MergeFiles/one_optional_level/groups=2,rows=20000    23.0µs ±11%  13.1µs ± 1%   -43.10%  (p=0.000 n=10+10)

name                                                 old row/s    new row/s    delta
MergeFiles/BOOLEAN/groups=2,rows=20000                57.0M ± 1%   57.5M ± 1%    +0.88%  (p=0.001 n=8+10)
MergeFiles/INT32/groups=2,rows=20000                  16.8M ± 0%  139.4M ± 1%  +728.57%  (p=0.000 n=8+8)
MergeFiles/INT64/groups=2,rows=20000                   151M ± 0%    151M ± 1%      ~     (p=1.000 n=9+8)
MergeFiles/INT96/groups=2,rows=20000                  29.1M ±15%   27.4M ±26%      ~     (p=0.393 n=10+10)
MergeFiles/FLOAT/groups=2,rows=20000                   156M ± 1%    154M ± 1%    -1.22%  (p=0.000 n=9+10)
MergeFiles/DOUBLE/groups=2,rows=20000                  146M ± 1%    145M ± 1%    -0.64%  (p=0.010 n=9+10)
MergeFiles/BYTE_ARRAY/groups=2,rows=20000             21.0M ± 1%   53.9M ± 0%  +156.05%  (p=0.000 n=10+8)
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000    126M ± 5%    124M ± 5%      ~     (p=0.143 n=10+10)
MergeFiles/STRING/groups=2,rows=20000                 47.5M ± 2%   87.5M ± 0%   +84.24%  (p=0.000 n=8+9)
MergeFiles/STRING_(dict)/groups=2,rows=20000          48.3M ± 0%   50.3M ± 0%    +4.25%  (p=0.000 n=10+8)
MergeFiles/UUID/groups=2,rows=20000                   23.3M ± 0%   60.5M ± 2%  +159.49%  (p=0.000 n=8+9)
MergeFiles/DECIMAL/groups=2,rows=20000                 150M ± 0%    150M ± 0%    -0.34%  (p=0.001 n=9+8)
MergeFiles/AddressBook/groups=2,rows=20000            4.15M ± 1%   4.43M ± 0%    +6.79%  (p=0.000 n=8+8)
MergeFiles/one_optional_level/groups=2,rows=20000     42.2M ±10%   73.9M ± 1%   +75.25%  (p=0.000 n=10+10)

achille-roussel avatar Jun 14 '22 18:06 achille-roussel

Further work on this issue is starting to yield diminishing returns, I'm going to close it and will open follow ups when we have more data to determine the further changes to focus on.

Snapshot of the benchmark results from 5bd5f6114638a749b9326aaf4ea5a6ea90cc9cf4

name                                                     time/op
MergeFiles/BOOLEAN/groups=2,rows=20000                   17.0µs ± 0%
MergeFiles/INT32/groups=2,rows=20000                     7.26µs ± 0%
MergeFiles/INT64/groups=2,rows=20000                     7.82µs ± 0%
MergeFiles/INT96/groups=2,rows=20000                     32.1µs ± 1%
MergeFiles/FLOAT/groups=2,rows=20000                     6.34µs ± 1%
MergeFiles/DOUBLE/groups=2,rows=20000                    6.77µs ± 1%
MergeFiles/BYTE_ARRAY/groups=2,rows=20000                11.9µs ± 2%
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000      7.59µs ± 1%
MergeFiles/STRING/groups=2,rows=20000                    8.16µs ± 0%
MergeFiles/STRING_(dict)/groups=2,rows=20000             7.21µs ± 3%
MergeFiles/UUID/groups=2,rows=20000                      15.9µs ± 0%
MergeFiles/DECIMAL/groups=2,rows=20000                   6.78µs ± 0%
MergeFiles/AddressBook/groups=2,rows=20000                176µs ± 0%
MergeFiles/one_optional_level/groups=2,rows=20000        10.5µs ± 0%
MergeFiles/one_repeated_level/groups=2,rows=20000        92.2µs ± 0%
MergeFiles/two_repeated_levels/groups=2,rows=20000       97.2µs ± 3%
MergeFiles/three_repeated_levels/groups=2,rows=20000     97.5µs ± 1%
MergeFiles/nested_lists/groups=2,rows=20000               227µs ± 1%
MergeFiles/key-value_pairs/groups=2,rows=20000            253µs ± 1%
MergeFiles/multiple_key-value_pairs/groups=2,rows=20000   747µs ± 2%
MergeFiles/repeated_key-value_pairs/groups=2,rows=20000   258µs ± 1%
MergeFiles/map_of_repeated_values/groups=2,rows=20000     101µs ± 2%

name                                                     row/s
MergeFiles/BOOLEAN/groups=2,rows=20000                    56.9M ± 1%
MergeFiles/INT32/groups=2,rows=20000                       133M ± 0%
MergeFiles/INT64/groups=2,rows=20000                       124M ± 0%
MergeFiles/INT96/groups=2,rows=20000                      30.1M ± 1%
MergeFiles/FLOAT/groups=2,rows=20000                       153M ± 1%
MergeFiles/DOUBLE/groups=2,rows=20000                      143M ± 1%
MergeFiles/BYTE_ARRAY/groups=2,rows=20000                 81.4M ± 2%
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000        128M ± 1%
MergeFiles/STRING/groups=2,rows=20000                      119M ± 0%
MergeFiles/STRING_(dict)/groups=2,rows=20000               134M ± 3%
MergeFiles/UUID/groups=2,rows=20000                       59.0M ± 0%
MergeFiles/DECIMAL/groups=2,rows=20000                     143M ± 0%
MergeFiles/AddressBook/groups=2,rows=20000                2.07M ± 0%
MergeFiles/one_optional_level/groups=2,rows=20000         92.6M ± 0%
MergeFiles/one_repeated_level/groups=2,rows=20000         7.16M ± 0%
MergeFiles/two_repeated_levels/groups=2,rows=20000        2.06M ± 3%
MergeFiles/three_repeated_levels/groups=2,rows=20000      2.05M ± 1%
MergeFiles/nested_lists/groups=2,rows=20000               1.32M ± 1%
MergeFiles/key-value_pairs/groups=2,rows=20000            3.40M ± 1%
MergeFiles/multiple_key-value_pairs/groups=2,rows=20000   1.15M ± 2%
MergeFiles/repeated_key-value_pairs/groups=2,rows=20000   1.39M ± 1%
MergeFiles/map_of_repeated_values/groups=2,rows=20000     1.66M ± 2%

name                                                     alloc/op
MergeFiles/BOOLEAN/groups=2,rows=20000                     191B ± 0%
MergeFiles/INT32/groups=2,rows=20000                       190B ± 0%
MergeFiles/INT64/groups=2,rows=20000                       190B ± 0%
MergeFiles/INT96/groups=2,rows=20000                     16.5kB ± 1%
MergeFiles/FLOAT/groups=2,rows=20000                       190B ± 0%
MergeFiles/DOUBLE/groups=2,rows=20000                      190B ± 0%
MergeFiles/BYTE_ARRAY/groups=2,rows=20000                  192B ± 0%
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000        191B ± 0%
MergeFiles/STRING/groups=2,rows=20000                      192B ± 0%
MergeFiles/STRING_(dict)/groups=2,rows=20000             4.18kB ± 0%
MergeFiles/UUID/groups=2,rows=20000                        198B ± 0%
MergeFiles/DECIMAL/groups=2,rows=20000                     190B ± 0%
MergeFiles/AddressBook/groups=2,rows=20000               1.72kB ± 2%
MergeFiles/one_optional_level/groups=2,rows=20000          197B ± 0%
MergeFiles/one_repeated_level/groups=2,rows=20000        75.3kB ± 0%
MergeFiles/two_repeated_levels/groups=2,rows=20000       62.3kB ± 0%
MergeFiles/three_repeated_levels/groups=2,rows=20000     62.3kB ± 0%
MergeFiles/nested_lists/groups=2,rows=20000               180kB ± 1%
MergeFiles/key-value_pairs/groups=2,rows=20000            186kB ± 0%
MergeFiles/multiple_key-value_pairs/groups=2,rows=20000   441kB ± 0%
MergeFiles/repeated_key-value_pairs/groups=2,rows=20000   158kB ± 1%
MergeFiles/map_of_repeated_values/groups=2,rows=20000    37.2kB ± 1%

name                                                     allocs/op
MergeFiles/BOOLEAN/groups=2,rows=20000                     0.00
MergeFiles/INT32/groups=2,rows=20000                       0.00
MergeFiles/INT64/groups=2,rows=20000                       0.00
MergeFiles/INT96/groups=2,rows=20000                        968 ± 0%
MergeFiles/FLOAT/groups=2,rows=20000                       0.00
MergeFiles/DOUBLE/groups=2,rows=20000                      0.00
MergeFiles/BYTE_ARRAY/groups=2,rows=20000                  0.00
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000        0.00
MergeFiles/STRING/groups=2,rows=20000                      0.00
MergeFiles/STRING_(dict)/groups=2,rows=20000               1.00 ± 0%
MergeFiles/UUID/groups=2,rows=20000                        1.00 ± 0%
MergeFiles/DECIMAL/groups=2,rows=20000                     0.00
MergeFiles/AddressBook/groups=2,rows=20000                 15.0 ± 0%
MergeFiles/one_optional_level/groups=2,rows=20000          1.00 ± 0%
MergeFiles/one_repeated_level/groups=2,rows=20000          41.0 ± 0%
MergeFiles/two_repeated_levels/groups=2,rows=20000         46.0 ± 0%
MergeFiles/three_repeated_levels/groups=2,rows=20000       46.0 ± 0%
MergeFiles/nested_lists/groups=2,rows=20000                90.0 ± 0%
MergeFiles/key-value_pairs/groups=2,rows=20000             83.0 ± 0%
MergeFiles/multiple_key-value_pairs/groups=2,rows=20000     230 ± 0%
MergeFiles/repeated_key-value_pairs/groups=2,rows=20000    73.0 ± 0%
MergeFiles/map_of_repeated_values/groups=2,rows=20000      26.0 ± 0%

achille-roussel avatar Sep 06 '22 16:09 achille-roussel