parquet-go
parquet-go copied to clipboard
Improve throughput of file merges
Merging parquet files is a critical operation for the use we are making of this library. Merging files is heavy on both the read and write paths, as files need to be decoded then re-encoded with the new content (e.g. rows may be re-ordered).
While we have invested in improving the write path, the algorithms involved on the read path are still trivial, yielding throughput way below the level of performance we need to achieve for production workloads.
This issue intends to track progress on improvement of the read throughput, especially in the context of merging parquet files.
+1 improving performance on the read path will be very useful for us as well. From some profiles we can see that ~60-70% of CPU on our compactors is spent in reading + reconstructing rows.
Progress so far:
name old time/op new time/op delta
MergeFiles/BOOLEAN/groups=2,rows=20000 17.0µs ± 1% 16.8µs ± 1% -0.87% (p=0.001 n=8+10)
MergeFiles/INT32/groups=2,rows=20000 57.5µs ± 0% 6.9µs ± 1% -87.93% (p=0.000 n=8+8)
MergeFiles/INT64/groups=2,rows=20000 6.41µs ± 0% 6.43µs ± 1% ~ (p=1.000 n=9+8)
MergeFiles/INT96/groups=2,rows=20000 33.5µs ±13% 36.4µs ±26% ~ (p=0.393 n=10+10)
MergeFiles/FLOAT/groups=2,rows=20000 6.21µs ± 1% 6.29µs ± 1% +1.23% (p=0.000 n=9+10)
MergeFiles/DOUBLE/groups=2,rows=20000 6.61µs ± 1% 6.65µs ± 1% +0.65% (p=0.010 n=9+10)
MergeFiles/BYTE_ARRAY/groups=2,rows=20000 46.0µs ± 1% 18.0µs ± 0% -60.95% (p=0.000 n=10+8)
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000 7.67µs ± 5% 7.81µs ± 5% ~ (p=0.148 n=10+10)
MergeFiles/STRING/groups=2,rows=20000 20.4µs ± 2% 11.1µs ± 0% -45.73% (p=0.000 n=8+9)
MergeFiles/STRING_(dict)/groups=2,rows=20000 20.1µs ± 0% 19.2µs ± 0% -4.08% (p=0.000 n=10+8)
MergeFiles/UUID/groups=2,rows=20000 41.5µs ± 0% 16.0µs ± 2% -61.46% (p=0.000 n=8+9)
MergeFiles/DECIMAL/groups=2,rows=20000 6.44µs ± 0% 6.46µs ± 0% +0.39% (p=0.000 n=9+9)
MergeFiles/AddressBook/groups=2,rows=20000 233µs ± 1% 219µs ± 0% -6.36% (p=0.000 n=8+8)
MergeFiles/one_optional_level/groups=2,rows=20000 23.0µs ±11% 13.1µs ± 1% -43.10% (p=0.000 n=10+10)
name old row/s new row/s delta
MergeFiles/BOOLEAN/groups=2,rows=20000 57.0M ± 1% 57.5M ± 1% +0.88% (p=0.001 n=8+10)
MergeFiles/INT32/groups=2,rows=20000 16.8M ± 0% 139.4M ± 1% +728.57% (p=0.000 n=8+8)
MergeFiles/INT64/groups=2,rows=20000 151M ± 0% 151M ± 1% ~ (p=1.000 n=9+8)
MergeFiles/INT96/groups=2,rows=20000 29.1M ±15% 27.4M ±26% ~ (p=0.393 n=10+10)
MergeFiles/FLOAT/groups=2,rows=20000 156M ± 1% 154M ± 1% -1.22% (p=0.000 n=9+10)
MergeFiles/DOUBLE/groups=2,rows=20000 146M ± 1% 145M ± 1% -0.64% (p=0.010 n=9+10)
MergeFiles/BYTE_ARRAY/groups=2,rows=20000 21.0M ± 1% 53.9M ± 0% +156.05% (p=0.000 n=10+8)
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000 126M ± 5% 124M ± 5% ~ (p=0.143 n=10+10)
MergeFiles/STRING/groups=2,rows=20000 47.5M ± 2% 87.5M ± 0% +84.24% (p=0.000 n=8+9)
MergeFiles/STRING_(dict)/groups=2,rows=20000 48.3M ± 0% 50.3M ± 0% +4.25% (p=0.000 n=10+8)
MergeFiles/UUID/groups=2,rows=20000 23.3M ± 0% 60.5M ± 2% +159.49% (p=0.000 n=8+9)
MergeFiles/DECIMAL/groups=2,rows=20000 150M ± 0% 150M ± 0% -0.34% (p=0.001 n=9+8)
MergeFiles/AddressBook/groups=2,rows=20000 4.15M ± 1% 4.43M ± 0% +6.79% (p=0.000 n=8+8)
MergeFiles/one_optional_level/groups=2,rows=20000 42.2M ±10% 73.9M ± 1% +75.25% (p=0.000 n=10+10)
Further work on this issue is starting to yield diminishing returns, I'm going to close it and will open follow ups when we have more data to determine the further changes to focus on.
Snapshot of the benchmark results from 5bd5f6114638a749b9326aaf4ea5a6ea90cc9cf4
name time/op
MergeFiles/BOOLEAN/groups=2,rows=20000 17.0µs ± 0%
MergeFiles/INT32/groups=2,rows=20000 7.26µs ± 0%
MergeFiles/INT64/groups=2,rows=20000 7.82µs ± 0%
MergeFiles/INT96/groups=2,rows=20000 32.1µs ± 1%
MergeFiles/FLOAT/groups=2,rows=20000 6.34µs ± 1%
MergeFiles/DOUBLE/groups=2,rows=20000 6.77µs ± 1%
MergeFiles/BYTE_ARRAY/groups=2,rows=20000 11.9µs ± 2%
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000 7.59µs ± 1%
MergeFiles/STRING/groups=2,rows=20000 8.16µs ± 0%
MergeFiles/STRING_(dict)/groups=2,rows=20000 7.21µs ± 3%
MergeFiles/UUID/groups=2,rows=20000 15.9µs ± 0%
MergeFiles/DECIMAL/groups=2,rows=20000 6.78µs ± 0%
MergeFiles/AddressBook/groups=2,rows=20000 176µs ± 0%
MergeFiles/one_optional_level/groups=2,rows=20000 10.5µs ± 0%
MergeFiles/one_repeated_level/groups=2,rows=20000 92.2µs ± 0%
MergeFiles/two_repeated_levels/groups=2,rows=20000 97.2µs ± 3%
MergeFiles/three_repeated_levels/groups=2,rows=20000 97.5µs ± 1%
MergeFiles/nested_lists/groups=2,rows=20000 227µs ± 1%
MergeFiles/key-value_pairs/groups=2,rows=20000 253µs ± 1%
MergeFiles/multiple_key-value_pairs/groups=2,rows=20000 747µs ± 2%
MergeFiles/repeated_key-value_pairs/groups=2,rows=20000 258µs ± 1%
MergeFiles/map_of_repeated_values/groups=2,rows=20000 101µs ± 2%
name row/s
MergeFiles/BOOLEAN/groups=2,rows=20000 56.9M ± 1%
MergeFiles/INT32/groups=2,rows=20000 133M ± 0%
MergeFiles/INT64/groups=2,rows=20000 124M ± 0%
MergeFiles/INT96/groups=2,rows=20000 30.1M ± 1%
MergeFiles/FLOAT/groups=2,rows=20000 153M ± 1%
MergeFiles/DOUBLE/groups=2,rows=20000 143M ± 1%
MergeFiles/BYTE_ARRAY/groups=2,rows=20000 81.4M ± 2%
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000 128M ± 1%
MergeFiles/STRING/groups=2,rows=20000 119M ± 0%
MergeFiles/STRING_(dict)/groups=2,rows=20000 134M ± 3%
MergeFiles/UUID/groups=2,rows=20000 59.0M ± 0%
MergeFiles/DECIMAL/groups=2,rows=20000 143M ± 0%
MergeFiles/AddressBook/groups=2,rows=20000 2.07M ± 0%
MergeFiles/one_optional_level/groups=2,rows=20000 92.6M ± 0%
MergeFiles/one_repeated_level/groups=2,rows=20000 7.16M ± 0%
MergeFiles/two_repeated_levels/groups=2,rows=20000 2.06M ± 3%
MergeFiles/three_repeated_levels/groups=2,rows=20000 2.05M ± 1%
MergeFiles/nested_lists/groups=2,rows=20000 1.32M ± 1%
MergeFiles/key-value_pairs/groups=2,rows=20000 3.40M ± 1%
MergeFiles/multiple_key-value_pairs/groups=2,rows=20000 1.15M ± 2%
MergeFiles/repeated_key-value_pairs/groups=2,rows=20000 1.39M ± 1%
MergeFiles/map_of_repeated_values/groups=2,rows=20000 1.66M ± 2%
name alloc/op
MergeFiles/BOOLEAN/groups=2,rows=20000 191B ± 0%
MergeFiles/INT32/groups=2,rows=20000 190B ± 0%
MergeFiles/INT64/groups=2,rows=20000 190B ± 0%
MergeFiles/INT96/groups=2,rows=20000 16.5kB ± 1%
MergeFiles/FLOAT/groups=2,rows=20000 190B ± 0%
MergeFiles/DOUBLE/groups=2,rows=20000 190B ± 0%
MergeFiles/BYTE_ARRAY/groups=2,rows=20000 192B ± 0%
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000 191B ± 0%
MergeFiles/STRING/groups=2,rows=20000 192B ± 0%
MergeFiles/STRING_(dict)/groups=2,rows=20000 4.18kB ± 0%
MergeFiles/UUID/groups=2,rows=20000 198B ± 0%
MergeFiles/DECIMAL/groups=2,rows=20000 190B ± 0%
MergeFiles/AddressBook/groups=2,rows=20000 1.72kB ± 2%
MergeFiles/one_optional_level/groups=2,rows=20000 197B ± 0%
MergeFiles/one_repeated_level/groups=2,rows=20000 75.3kB ± 0%
MergeFiles/two_repeated_levels/groups=2,rows=20000 62.3kB ± 0%
MergeFiles/three_repeated_levels/groups=2,rows=20000 62.3kB ± 0%
MergeFiles/nested_lists/groups=2,rows=20000 180kB ± 1%
MergeFiles/key-value_pairs/groups=2,rows=20000 186kB ± 0%
MergeFiles/multiple_key-value_pairs/groups=2,rows=20000 441kB ± 0%
MergeFiles/repeated_key-value_pairs/groups=2,rows=20000 158kB ± 1%
MergeFiles/map_of_repeated_values/groups=2,rows=20000 37.2kB ± 1%
name allocs/op
MergeFiles/BOOLEAN/groups=2,rows=20000 0.00
MergeFiles/INT32/groups=2,rows=20000 0.00
MergeFiles/INT64/groups=2,rows=20000 0.00
MergeFiles/INT96/groups=2,rows=20000 968 ± 0%
MergeFiles/FLOAT/groups=2,rows=20000 0.00
MergeFiles/DOUBLE/groups=2,rows=20000 0.00
MergeFiles/BYTE_ARRAY/groups=2,rows=20000 0.00
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=2,rows=20000 0.00
MergeFiles/STRING/groups=2,rows=20000 0.00
MergeFiles/STRING_(dict)/groups=2,rows=20000 1.00 ± 0%
MergeFiles/UUID/groups=2,rows=20000 1.00 ± 0%
MergeFiles/DECIMAL/groups=2,rows=20000 0.00
MergeFiles/AddressBook/groups=2,rows=20000 15.0 ± 0%
MergeFiles/one_optional_level/groups=2,rows=20000 1.00 ± 0%
MergeFiles/one_repeated_level/groups=2,rows=20000 41.0 ± 0%
MergeFiles/two_repeated_levels/groups=2,rows=20000 46.0 ± 0%
MergeFiles/three_repeated_levels/groups=2,rows=20000 46.0 ± 0%
MergeFiles/nested_lists/groups=2,rows=20000 90.0 ± 0%
MergeFiles/key-value_pairs/groups=2,rows=20000 83.0 ± 0%
MergeFiles/multiple_key-value_pairs/groups=2,rows=20000 230 ± 0%
MergeFiles/repeated_key-value_pairs/groups=2,rows=20000 73.0 ± 0%
MergeFiles/map_of_repeated_values/groups=2,rows=20000 26.0 ± 0%