GH-43891: [C++][Parquet] Faster reading of FIXED_LEN_BYTE_ARRAY data
Rationale for this change
Reading FIXED_LEN_BYTE_ARRAY columns goes through an intermediate array of FLBA structures, even when the end goal is to decode to a FixedSizeBinaryArray. This makes reading FLOAT16 data slower than FLOAT, even though the data is smaller in memory.
What changes are included in this PR?
Improve the performance of reading FIXED_LEN_BYTE_ARRAY columns to Arrow, by avoiding an intermediate read to FLBA structures. This especially helps improve the speed of reading FLOAT16 columns and makes it faster than FLOAT.
A couple additional changes:
- simplify the inheritance structure of decoder (less virtual multiple inheritance)
- convert some decoder methods from
VisitNullBitmapInlinetoVisitSetBitRuns, which can be faster because the visitor operates on multiple values or nulls
GH-43891 reproducer
- on git main:
$ taskset -c 1 python ../issue_43891.py
.
writing parquet file:/tmp/my.parquet, columns=7000, row_groups=1, rows=64000, compression=None, dtype=float
Parquet size=1.8 GB
finished writing parquet file in 2.58 seconds
`ParquetReader.read_row_groups`, dtype:float, duration:0.93 seconds
.
writing parquet file:/tmp/my.parquet, columns=7000, row_groups=1, rows=64000, compression=None, dtype=halffloat
Parquet size=896.9 MB
finished writing parquet file in 3.50 seconds
`ParquetReader.read_row_groups`, dtype:halffloat, duration:1.93 seconds
- on this PR:
$ taskset -c 1 python ../issue_43891.py
.
writing parquet file:/tmp/my.parquet, columns=7000, row_groups=1, rows=64000, compression=None, dtype=float
Parquet size=1.8 GB
finished writing parquet file in 2.60 seconds
`ParquetReader.read_row_groups`, dtype:float, duration:0.93 seconds
.
writing parquet file:/tmp/my.parquet, columns=7000, row_groups=1, rows=64000, compression=None, dtype=halffloat
Parquet size=896.9 MB
finished writing parquet file in 3.56 seconds
`ParquetReader.read_row_groups`, dtype:halffloat, duration:0.68 seconds
Float16 micro-benchmarks
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (12)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark baseline contender change % counters
BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1 562.162 MiB/sec 3.355 GiB/sec 511.183 {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 19}
BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1 578.746 MiB/sec 3.454 GiB/sec 511.120 {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 20}
BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100 394.509 MiB/sec 1.393 GiB/sec 261.589 {'family_index': 13, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14}
BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99 294.532 MiB/sec 1.033 GiB/sec 259.291 {'family_index': 11, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 11}
BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100 398.375 MiB/sec 1.392 GiB/sec 257.827 {'family_index': 11, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13}
BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0 433.677 MiB/sec 1.352 GiB/sec 219.279 {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15}
BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0 456.924 MiB/sec 1.337 GiB/sec 199.528 {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16}
BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99 353.429 MiB/sec 1.027 GiB/sec 197.437 {'family_index': 13, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 12}
BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50 180.961 MiB/sec 523.042 MiB/sec 189.037 {'family_index': 11, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7}
BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50 187.537 MiB/sec 515.843 MiB/sec 175.061 {'family_index': 13, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6}
BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1 372.658 MiB/sec 978.543 MiB/sec 162.585 {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 12}
BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1 380.629 MiB/sec 965.615 MiB/sec 153.689 {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13}
Binary/BinaryView micro-benchmarks
(changes < 10% omitted)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (72)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark baseline contender change % counters
BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/4096 391.277 MiB/sec 492.255 MiB/sec 25.807 {'family_index': 19, 'per_family_instance_index': 1, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 11865}
BM_ArrowBinaryPlain/DecodeArrow_Dense/4096 391.335 MiB/sec 491.410 MiB/sec 25.573 {'family_index': 18, 'per_family_instance_index': 1, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dense/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 11825}
BM_ArrowBinaryPlain/DecodeArrow_Dense/1024 408.195 MiB/sec 510.719 MiB/sec 25.116 {'family_index': 18, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dense/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 48657}
BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/1024 409.060 MiB/sec 511.452 MiB/sec 25.031 {'family_index': 19, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 48455}
BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1 1.299 GiB/sec 1.616 GiB/sec 24.373 {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1 1.337 GiB/sec 1.662 GiB/sec 24.334 {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1 1.202 GiB/sec 1.479 GiB/sec 23.090 {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1 1.215 GiB/sec 1.484 GiB/sec 22.148 {'family_index': 15, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
BM_ArrowBinaryViewPlain/DecodeArrow_Dense/65536 247.404 MiB/sec 297.437 MiB/sec 20.223 {'family_index': 22, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryViewPlain/DecodeArrow_Dense/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 468}
BM_ArrowBinaryPlain/DecodeArrow_Dense/32768 390.510 MiB/sec 469.172 MiB/sec 20.144 {'family_index': 18, 'per_family_instance_index': 2, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dense/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1448}
BM_ArrowBinaryPlain/DecodeArrow_Dense/65536 389.956 MiB/sec 466.511 MiB/sec 19.632 {'family_index': 18, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dense/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 721}
BM_ArrowBinaryViewPlain/DecodeArrow_Dense/1024 246.284 MiB/sec 293.331 MiB/sec 19.103 {'family_index': 22, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryViewPlain/DecodeArrow_Dense/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 28722}
BM_ArrowBinaryViewPlain/DecodeArrow_Dense/4096 246.550 MiB/sec 293.354 MiB/sec 18.984 {'family_index': 22, 'per_family_instance_index': 1, 'run_name': 'BM_ArrowBinaryViewPlain/DecodeArrow_Dense/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7442}
BM_ArrowBinaryViewPlain/DecodeArrowNonNull_Dense/32768 248.616 MiB/sec 295.759 MiB/sec 18.962 {'family_index': 23, 'per_family_instance_index': 2, 'run_name': 'BM_ArrowBinaryViewPlain/DecodeArrowNonNull_Dense/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 916}
BM_ArrowBinaryViewPlain/DecodeArrowNonNull_Dense/65536 248.044 MiB/sec 294.672 MiB/sec 18.798 {'family_index': 23, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryViewPlain/DecodeArrowNonNull_Dense/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 457}
BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/65536 387.205 MiB/sec 459.602 MiB/sec 18.697 {'family_index': 19, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 726}
BM_ArrowBinaryViewPlain/DecodeArrow_Dense/32768 251.566 MiB/sec 297.910 MiB/sec 18.422 {'family_index': 22, 'per_family_instance_index': 2, 'run_name': 'BM_ArrowBinaryViewPlain/DecodeArrow_Dense/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 937}
BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/32768 390.726 MiB/sec 461.577 MiB/sec 18.133 {'family_index': 19, 'per_family_instance_index': 2, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dense/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1441}
BM_ArrowBinaryViewPlain/DecodeArrowNonNull_Dense/4096 246.411 MiB/sec 289.293 MiB/sec 17.402 {'family_index': 23, 'per_family_instance_index': 1, 'run_name': 'BM_ArrowBinaryViewPlain/DecodeArrowNonNull_Dense/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7502}
BM_ArrowBinaryViewPlain/DecodeArrowNonNull_Dense/1024 248.673 MiB/sec 291.893 MiB/sec 17.380 {'family_index': 23, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryViewPlain/DecodeArrowNonNull_Dense/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 29796}
BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/32768 150.793 MiB/sec 176.646 MiB/sec 17.145 {'family_index': 21, 'per_family_instance_index': 2, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 578}
BM_ArrowBinaryPlain/DecodeArrow_Dict/1024 160.740 MiB/sec 187.337 MiB/sec 16.547 {'family_index': 20, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dict/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 19403}
BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/4096 160.799 MiB/sec 186.265 MiB/sec 15.837 {'family_index': 21, 'per_family_instance_index': 1, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4949}
BM_ArrowBinaryPlain/DecodeArrow_Dict/32768 153.344 MiB/sec 177.431 MiB/sec 15.708 {'family_index': 20, 'per_family_instance_index': 2, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dict/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 565}
BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/65536 144.941 MiB/sec 166.843 MiB/sec 15.111 {'family_index': 21, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 274}
BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024 163.270 MiB/sec 187.429 MiB/sec 14.797 {'family_index': 21, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 19460}
BM_ArrowBinaryPlain/DecodeArrow_Dict/4096 163.489 MiB/sec 186.442 MiB/sec 14.040 {'family_index': 20, 'per_family_instance_index': 1, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dict/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4840}
BM_ReadBinaryColumnDeltaByteArray/null_probability:1/unique_values:-1 943.358 MiB/sec 1.046 GiB/sec 13.569 {'family_index': 16, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1 1.004 GiB/sec 1.132 GiB/sec 12.738 {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
BM_ReadBinaryColumn/null_probability:1/unique_values:-1 947.094 MiB/sec 1.042 GiB/sec 12.638 {'family_index': 14, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
BM_ArrowBinaryPlain/DecodeArrow_Dict/65536 153.064 MiB/sec 171.889 MiB/sec 12.298 {'family_index': 20, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dict/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 275}
BM_ReadBinaryColumn/null_probability:0/unique_values:-1 1.012 GiB/sec 1.135 GiB/sec 12.101 {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
Are these changes tested?
Yes.
Are there any user-facing changes?
No.
- GitHub Issue: #43891
@github-actions crossbow submit -g cpp
Revision: 1a13443b29fda5fe2f7f27715faff23a7aabc4ab
Submitted crossbow builds: ursacomputing/crossbow @ actions-53947102fd
@github-actions crossbow submit -g cpp
Revision: 7b3aab52ce74fe7ea4ba4fe1a467118027300788
Submitted crossbow builds: ursacomputing/crossbow @ actions-272d7c4e76
cc @wgtmac @mapleFU
General ideas LGTM, it's a bit late in my tz and I will take a careful round tomorrow
I think I've addressed your comments @mapleFU . Do you want to take another look?
@github-actions crossbow submit -g cpp
Revision: 6f4370789c52686e1a33f1ea7f5f7943ae9127be
Submitted crossbow builds: ursacomputing/crossbow @ actions-5d23cc8567
@wgtmac Do you want to take a look at this? Otherwise I'll merge.
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 587b3fcad5ae4524411714c80a1479ff9a435508.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details.