Multi-column null sanitization for struct columns
Description
Partially addresses https://github.com/rapidsai/cudf/issues/18350
This PR builds on the performance improvements in https://github.com/rapidsai/cudf/pull/19014. Instead of sequentially superimposing the top-level struct column null mask over the descendant columns, this PR introduces a new kernel that parallelizes the operation at two levels - (i) over all struct columns in the input table, and (ii) for each struct column, over all the root-to-leaf paths in the column hierarchy.
Checklist
- [X] I am familiar with the Contributing Guidelines.
- [X] New or existing tests cover these changes.
- [X] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
Profiles
Input table: Table with ~306k rows and 64 struct columns
./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT (https://github.com/rapidsai/cudf/pull/19014)
~24ms spent in superimpose+sanitize nulls. Note that the operation is performed for each column separately. Moreover, the superimpose step is performed sequentially as we traverse the descendants of the struct column i.e. the null masks are superimposed iteratively for each descendant node as we traverse the struct hierarchy.
./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT (current PR)
~8ms spent in superimpose+sanitize with the multi-column optimization! The 3x speedup is primarily due to kernel fusion - the multi-column and multi-path - for within-column superimpose - kernel does all the necessary superimpose operations in parallel with a single kernel.
/ok to test 19c37f2
/ok to test 19c37f2
Apparently your commits are signed. I wonder if you already are
cpp-codeowner?
There's one that's not verified :(
Benchmark Results
~33% increase in throughput
parquet_read_decode (current PR)
[0] NVIDIA A100 80GB PCIe
| data_type | io_type | cardinality | run_length | data_size | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| STRUCT | DEVICE_BUFFER | 0 | 1 | 536870912 | 384x | 38.566 ms | 1.55% | 38.557 ms | 1.55% | 13924220177 | 1.392 GiB | 499.298 MiB |
parquet_read_decode (pre-https://github.com/rapidsai/cudf/pull/19014)
[0] NVIDIA A100 80GB PCIe
| data_type | io_type | cardinality | run_length | data_size | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| STRUCT | DEVICE_BUFFER | 0 | 1 | 536870912 | 9x | 56.910 ms | 0.19% | 56.901 ms | 0.19% | 9435111923 | 1.380 GiB | 499.298 MiB |
/ok to test 19c37f2
Apparently your commits are signed. I wonder if you already are
cpp-codeowner?There's one that's not verified :(
Yes, my gitconfig is not enforcing signing on the merge commits somehow. I need to fix this :(
Yes, my gitconfig is not enforcing signing on the merge commits somehow. I need to fix this :(
That happens when you update your branch outside of cursor/vscode and then forget to pull before commiting anything to your branch in vscode/cursor. That usually leads to an unsigned commit + an automatic merge commit when you push
Would love to see a benchmark for
segmented_offset_bitmask_binopsweeping overbitmasks_per_segmentanddestination_size/source_size_bitsaxes. It might be worth using a block instead of warp per segment at some point.
Thank you for the suggestion - added a benchmark for this.
Benchmark Results
segmented_bitmask_and
[0] NVIDIA A100 80GB PCIe
| num_segments | expected_masks_per_segment | mask_size_bits | input size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | Mbytes_per_second |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 4 | 32 | 3192 | 7456x | 74.779 us | 96.05% | 67.095 us | 107.04% | 47.574M | 47.574 MB/s | 0.00% | 45 |
| 1000 | 4 | 32 | 31712 | 2704x | 304.538 us | 32.52% | 297.200 us | 33.33% | 106.703M | 106.703 MB/s | 0.01% | 101 |
| 10000 | 4 | 32 | 319736 | 1708x | 2.555 ms | 4.64% | 2.547 ms | 4.66% | 125.511M | 125.511 MB/s | 0.01% | 119 |
| 100 | 8 | 32 | 6392 | 7136x | 77.484 us | 6.25% | 70.161 us | 6.92% | 91.104M | 91.104 MB/s | 0.00% | 86 |
| 1000 | 8 | 32 | 63712 | 1648x | 313.509 us | 1.89% | 306.148 us | 1.94% | 208.109M | 208.109 MB/s | 0.01% | 198 |
| 10000 | 8 | 32 | 639736 | 1680x | 2.640 ms | 4.04% | 2.633 ms | 4.05% | 243.003M | 243.003 MB/s | 0.01% | 231 |
| 100 | 16 | 32 | 12792 | 6992x | 79.079 us | 5.52% | 71.540 us | 5.99% | 178.809M | 178.809 MB/s | 0.01% | 170 |
| 1000 | 16 | 32 | 127712 | 2368x | 340.510 us | 6.43% | 332.978 us | 6.56% | 383.545M | 383.545 MB/s | 0.02% | 365 |
| 10000 | 16 | 32 | 1279736 | 288x | 2.763 ms | 1.24% | 2.756 ms | 1.25% | 464.366M | 464.366 MB/s | 0.02% | 442 |
| 100 | 4 | 64 | 4588 | 8144x | 68.759 us | 2.79% | 61.446 us | 3.00% | 74.667M | 74.667 MB/s | 0.00% | 71 |
| 1000 | 4 | 64 | 45568 | 2448x | 303.213 us | 4.18% | 295.836 us | 4.27% | 154.031M | 154.031 MB/s | 0.01% | 146 |
| 10000 | 4 | 64 | 459604 | 1693x | 2.568 ms | 3.54% | 2.561 ms | 3.55% | 179.481M | 179.481 MB/s | 0.01% | 171 |
| 100 | 8 | 64 | 9388 | 7104x | 78.043 us | 7.20% | 70.494 us | 7.94% | 133.174M | 133.174 MB/s | 0.01% | 127 |
| 1000 | 8 | 64 | 93568 | 1648x | 313.579 us | 1.69% | 306.027 us | 1.73% | 305.751M | 305.751 MB/s | 0.02% | 291 |
| 10000 | 8 | 64 | 939604 | 1024x | 2.600 ms | 1.34% | 2.592 ms | 1.34% | 362.498M | 362.498 MB/s | 0.02% | 345 |
| 100 | 16 | 64 | 18988 | 7120x | 77.703 us | 5.05% | 70.293 us | 5.51% | 270.126M | 270.126 MB/s | 0.01% | 257 |
| 1000 | 16 | 64 | 189568 | 3184x | 338.151 us | 10.95% | 330.565 us | 11.19% | 573.467M | 573.467 MB/s | 0.03% | 546 |
| 10000 | 16 | 64 | 1899604 | 512x | 2.766 ms | 1.27% | 2.758 ms | 1.27% | 688.756M | 688.756 MB/s | 0.04% | 656 |
| 100 | 4 | 128 | 7380 | 8128x | 69.045 us | 2.52% | 61.533 us | 2.77% | 119.935M | 119.935 MB/s | 0.01% | 114 |
| 1000 | 4 | 128 | 73280 | 3168x | 302.243 us | 7.10% | 294.720 us | 7.27% | 248.643M | 248.643 MB/s | 0.01% | 237 |
| 10000 | 4 | 128 | 739340 | 1056x | 2.549 ms | 1.18% | 2.541 ms | 1.19% | 290.920M | 290.920 MB/s | 0.02% | 277 |
| 100 | 8 | 128 | 15380 | 7072x | 78.109 us | 6.49% | 70.744 us | 7.07% | 217.403M | 217.403 MB/s | 0.01% | 207 |
| 1000 | 8 | 128 | 153280 | 2240x | 314.160 us | 2.59% | 306.666 us | 2.44% | 499.828M | 499.828 MB/s | 0.03% | 476 |
| 10000 | 8 | 128 | 1539340 | 1200x | 2.632 ms | 5.31% | 2.624 ms | 5.32% | 586.573M | 586.573 MB/s | 0.03% | 559 |
| 100 | 16 | 128 | 31380 | 7008x | 78.849 us | 1.79% | 71.510 us | 1.94% | 438.819M | 438.819 MB/s | 0.02% | 418 |
| 1000 | 16 | 128 | 313280 | 1520x | 338.739 us | 1.35% | 331.401 us | 1.38% | 945.321M | 945.321 MB/s | 0.05% | 901 |
| 10000 | 16 | 128 | 3139340 | 784x | 2.765 ms | 1.31% | 2.758 ms | 1.32% | 1.138G | 1.138 GB/s | 0.06% | 1085 |
/ok to test 49f26de
Thank you for the suggestion - added a benchmark for this.
Thank you for adding this. Can we also run this on branch-25.08 and compare performance. Recommended to remove redundant columns like Samples, Noise.
Also maybe a dumb question, what is BWUtil column and why is it always 0%?
Maybe another dumb question, is expected_masks_per_segment the average number of masks per segment? If so, can we just name it avg_masks_per_segment just so the column is not that wide.
Edit: Not intending to increase work for you (😄) but if you compare the performance against branch-25.08, it would be nice if you could just combine results into one table something like this
Thank you for the suggestion - added a benchmark for this.
Thank you for adding this. Can we also run this on
branch-25.08and compare performance. Recommended to remove redundant columns likeSamples,Noise.
We cannot directly compare with main since a multi-segment approach did not exist for bitwise operations. We can put together a baseline implementation using the bitmask_binop in 25.08 and process segments sequentially in a loop. I have introduced another benchmark multi_segment_bitwise_and which implements this approach.
Also maybe a dumb question, what is
BWUtilcolumn and why is it always 0%? Maybe another dumb question, isexpected_masks_per_segmentthe average number of masks per segment? If so, can we just name itavg_masks_per_segmentjust so the column is not that wide.
BWUtil is global memory utilization as a percentage of global memory bandwidth - it is really low for this kernel since our input sizes are still very small. I'm using expected_masks_per_segment since I'm considering this axis parameter to be the mean of the gaussian distribution for segment sizes.
Edit: Not intending to increase work for you (😄) but if you compare the performance against
branch-25.08, it would be nice if you could just combine results into one table something like this
| num_segments | expected_masks_per_segment | mask_size_bits | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|---|---|---|---|---|---|---|---|---|---|
| 100 | 4 | 32 | 3.117 ms | 8.47% | 123.009 us | 53.29% | -2994.196 us | -96.05% | FAST |
| 1000 | 4 | 32 | 30.138 ms | 3.80% | 876.038 us | 1.25% | -29261.583 us | -97.09% | FAST |
| 10000 | 4 | 32 | 300.686 ms | 0.77% | 8.420 ms | 4.52% | -292266.573 us | -97.20% | FAST |
| 100 | 8 | 32 | 3.592 ms | 5.48% | 129.307 us | 1.59% | -3462.461 us | -96.40% | FAST |
| 1000 | 8 | 32 | 35.980 ms | 0.17% | 885.215 us | 0.96% | -35094.974 us | -97.54% | FAST |
| 10000 | 8 | 32 | 359.420 ms | 0.50% | 8.492 ms | 2.98% | -350928.364 us | -97.64% | FAST |
| 100 | 16 | 32 | 3.663 ms | 0.41% | 130.078 us | 1.47% | -3532.741 us | -96.45% | FAST |
| 1000 | 16 | 32 | 36.740 ms | 0.14% | 913.124 us | 1.17% | -35826.887 us | -97.51% | FAST |
| 10000 | 16 | 32 | 363.865 ms | 0.33% | 8.605 ms | 2.78% | -355260.631 us | -97.64% | FAST |
| 100 | 4 | 64 | 3.005 ms | 1.01% | 120.781 us | 2.20% | -2884.040 us | -95.98% | FAST |
| 1000 | 4 | 64 | 29.456 ms | 0.29% | 879.138 us | 1.25% | -28577.188 us | -97.02% | FAST |
| 10000 | 4 | 64 | 296.477 ms | 0.08% | 8.345 ms | 1.31% | -288131.614 us | -97.19% | FAST |
| 100 | 8 | 64 | 3.539 ms | 5.05% | 129.508 us | 2.05% | -3409.795 us | -96.34% | FAST |
| 1000 | 8 | 64 | 35.390 ms | 0.33% | 884.424 us | 1.00% | -34505.961 us | -97.50% | FAST |
| 10000 | 8 | 64 | 357.183 ms | 0.08% | 8.521 ms | 0.92% | -348662.507 us | -97.61% | FAST |
| 100 | 16 | 64 | 3.639 ms | 0.44% | 131.232 us | 15.66% | -3508.043 us | -96.39% | FAST |
| 1000 | 16 | 64 | 36.358 ms | 0.50% | 916.605 us | 0.76% | -35441.242 us | -97.48% | FAST |
| 10000 | 16 | 64 | 361.185 ms | 0.23% | 8.657 ms | 2.77% | -352527.788 us | -97.60% | FAST |
| 100 | 4 | 128 | 2.999 ms | 1.06% | 121.859 us | 2.53% | -2876.852 us | -95.94% | FAST |
| 1000 | 4 | 128 | 29.471 ms | 1.54% | 876.995 us | 1.02% | -28593.991 us | -97.02% | FAST |
| 10000 | 4 | 128 | 296.791 ms | 0.09% | 8.421 ms | 1.18% | -288369.912 us | -97.16% | FAST |
| 100 | 8 | 128 | 3.531 ms | 0.63% | 130.801 us | 1.51% | -3399.740 us | -96.30% | FAST |
| 1000 | 8 | 128 | 35.327 ms | 0.21% | 882.967 us | 2.87% | -34443.824 us | -97.50% | FAST |
| 10000 | 8 | 128 | 358.118 ms | 0.22% | 8.450 ms | 1.12% | -349668.020 us | -97.64% | FAST |
| 100 | 16 | 128 | 3.644 ms | 0.38% | 130.875 us | 1.41% | -3512.854 us | -96.41% | FAST |
| 1000 | 16 | 128 | 36.375 ms | 0.14% | 918.422 us | 0.76% | -35456.856 us | -97.48% | FAST |
| 10000 | 16 | 128 | 361.774 ms | 0.26% | 8.634 ms | 0.75% | -353140.489 us | -97.61% | FAST |
/ok to test 7d64717
/ok to test 5b3b51e
/merge