Parquet reader list microkernel
This PR refactors fixed-width parquet list reader decoding into its own set of micro-kernels, templatizing the existing fixed-width microkernels. When skipping rows for lists, this will skip ahead the decoding of the definition, repetition, and dictionary rle_streams as well. The list kernel uses 128 threads per block and 71 registers per thread, so I've changed the launch_bounds to enforce a minimum of 8 blocks per SM. This causes a small register spill but the benchmarks are still faster, as seen below:
DEVICE_BUFFER list benchmarks (decompress + decode, not bound by IO): run_length 1, cardinality 0, no byte_limit: 24.7% faster run_length 32, cardinality 1000, no byte_limit: 18.3% faster run_length 1, cardinality 0, 500kb byte_limit: 57% faster run_length 32, cardinality 1000, 500kb byte_limit: 53% faster
Compressed list of ints on hard drive: 5.5% faster Sample real data on hard drive (many columns not lists): 0.5% faster
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Seems like this is also adding list support to the split page path as well. Am I reading this right?
One thing I've been thinking about is maybe splitting this file into two or three pieces.
- One cu file containing the core loops for each of the major kernels (and the host side launch code)
- A cuh file for the "update" functions
- A cuh file for the "decode values" functions.
Definitely not for this PR, but something to think about down the road. I think it might help make the volume of code that has built up here more tractable.
Seems like this is also adding list support to the split page path as well. Am I reading this right?
Yes.
Please also run
compute-sanitizeron the unit tests to make sure everything is good.
Tests pass.
/merge