cudf Parquet reader list microkernel

This PR refactors fixed-width parquet list reader decoding into its own set of micro-kernels, templatizing the existing fixed-width microkernels. When skipping rows for lists, this will skip ahead the decoding of the definition, repetition, and dictionary rle_streams as well. The list kernel uses 128 threads per block and 71 registers per thread, so I've changed the launch_bounds to enforce a minimum of 8 blocks per SM. This causes a small register spill but the benchmarks are still faster, as seen below:

DEVICE_BUFFER list benchmarks (decompress + decode, not bound by IO): run_length 1, cardinality 0, no byte_limit: 24.7% faster run_length 32, cardinality 1000, no byte_limit: 18.3% faster run_length 1, cardinality 0, 500kb byte_limit: 57% faster run_length 32, cardinality 1000, 500kb byte_limit: 53% faster

Compressed list of ints on hard drive: 5.5% faster Sample real data on hard drive (many columns not lists): 0.5% faster

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

Aug 12 '24 20:08 pmattione-nvidia

Seems like this is also adding list support to the split page path as well. Am I reading this right?

Oct 17 '24 22:10 nvdbaranec

One thing I've been thinking about is maybe splitting this file into two or three pieces.

One cu file containing the core loops for each of the major kernels (and the host side launch code)
A cuh file for the "update" functions
A cuh file for the "decode values" functions.

Definitely not for this PR, but something to think about down the road. I think it might help make the volume of code that has built up here more tractable.

Oct 17 '24 22:10 nvdbaranec

Seems like this is also adding list support to the split page path as well. Am I reading this right?

Yes.

Oct 18 '24 15:10 pmattione-nvidia

Please also run compute-sanitizer on the unit tests to make sure everything is good.

Tests pass.

Oct 28 '24 20:10 pmattione-nvidia

/merge

Oct 29 '24 21:10 pmattione-nvidia