cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Multi-column null sanitization for struct columns

Open shrshi opened this issue 8 months ago • 8 comments

Description

Partially addresses https://github.com/rapidsai/cudf/issues/18350

This PR builds on the performance improvements in https://github.com/rapidsai/cudf/pull/19014. Instead of sequentially superimposing the top-level struct column null mask over the descendant columns, this PR introduces a new kernel that parallelizes the operation at two levels - (i) over all struct columns in the input table, and (ii) for each struct column, over all the root-to-leaf paths in the column hierarchy.

Checklist

  • [X] I am familiar with the Contributing Guidelines.
  • [X] New or existing tests cover these changes.
  • [X] The documentation is up to date with these changes.

shrshi avatar Jun 03 '25 23:06 shrshi

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Jun 03 '25 23:06 copy-pr-bot[bot]

Profiles

Input table: Table with ~306k rows and 64 struct columns

./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT (https://github.com/rapidsai/cudf/pull/19014)

image ~24ms spent in superimpose+sanitize nulls. Note that the operation is performed for each column separately. Moreover, the superimpose step is performed sequentially as we traverse the descendants of the struct column i.e. the null masks are superimposed iteratively for each descendant node as we traverse the struct hierarchy.

./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT (current PR)

image ~8ms spent in superimpose+sanitize with the multi-column optimization! The 3x speedup is primarily due to kernel fusion - the multi-column and multi-path - for within-column superimpose - kernel does all the necessary superimpose operations in parallel with a single kernel.

shrshi avatar Jun 11 '25 21:06 shrshi

/ok to test 19c37f2

shrshi avatar Jun 13 '25 01:06 shrshi

/ok to test 19c37f2

Apparently your commits are signed. I wonder if you already are cpp-codeowner?

ttnghia avatar Jun 13 '25 03:06 ttnghia

/ok to test 19c37f2

Apparently your commits are signed. I wonder if you already are cpp-codeowner?

There's one that's not verified :(

vuule avatar Jun 13 '25 19:06 vuule

Benchmark Results

~33% increase in throughput

parquet_read_decode (current PR)

[0] NVIDIA A100 80GB PCIe

data_type io_type cardinality run_length data_size Samples CPU Time Noise GPU Time Noise bytes_per_second peak_memory_usage encoded_file_size
STRUCT DEVICE_BUFFER 0 1 536870912 384x 38.566 ms 1.55% 38.557 ms 1.55% 13924220177 1.392 GiB 499.298 MiB

parquet_read_decode (pre-https://github.com/rapidsai/cudf/pull/19014)

[0] NVIDIA A100 80GB PCIe

data_type io_type cardinality run_length data_size Samples CPU Time Noise GPU Time Noise bytes_per_second peak_memory_usage encoded_file_size
STRUCT DEVICE_BUFFER 0 1 536870912 9x 56.910 ms 0.19% 56.901 ms 0.19% 9435111923 1.380 GiB 499.298 MiB

shrshi avatar Jun 13 '25 20:06 shrshi

/ok to test 19c37f2

Apparently your commits are signed. I wonder if you already are cpp-codeowner?

There's one that's not verified :(

Yes, my gitconfig is not enforcing signing on the merge commits somehow. I need to fix this :(

shrshi avatar Jun 13 '25 21:06 shrshi

Yes, my gitconfig is not enforcing signing on the merge commits somehow. I need to fix this :(

That happens when you update your branch outside of cursor/vscode and then forget to pull before commiting anything to your branch in vscode/cursor. That usually leads to an unsigned commit + an automatic merge commit when you push

mhaseeb123 avatar Jun 13 '25 21:06 mhaseeb123

Would love to see a benchmark for segmented_offset_bitmask_binop sweeping over bitmasks_per_segment and destination_size/source_size_bits axes. It might be worth using a block instead of warp per segment at some point.

Thank you for the suggestion - added a benchmark for this.

Benchmark Results

segmented_bitmask_and

[0] NVIDIA A100 80GB PCIe

num_segments expected_masks_per_segment mask_size_bits input size Samples CPU Time Noise GPU Time Noise Elem/s GlobalMem BW BWUtil Mbytes_per_second
100 4 32 3192 7456x 74.779 us 96.05% 67.095 us 107.04% 47.574M 47.574 MB/s 0.00% 45
1000 4 32 31712 2704x 304.538 us 32.52% 297.200 us 33.33% 106.703M 106.703 MB/s 0.01% 101
10000 4 32 319736 1708x 2.555 ms 4.64% 2.547 ms 4.66% 125.511M 125.511 MB/s 0.01% 119
100 8 32 6392 7136x 77.484 us 6.25% 70.161 us 6.92% 91.104M 91.104 MB/s 0.00% 86
1000 8 32 63712 1648x 313.509 us 1.89% 306.148 us 1.94% 208.109M 208.109 MB/s 0.01% 198
10000 8 32 639736 1680x 2.640 ms 4.04% 2.633 ms 4.05% 243.003M 243.003 MB/s 0.01% 231
100 16 32 12792 6992x 79.079 us 5.52% 71.540 us 5.99% 178.809M 178.809 MB/s 0.01% 170
1000 16 32 127712 2368x 340.510 us 6.43% 332.978 us 6.56% 383.545M 383.545 MB/s 0.02% 365
10000 16 32 1279736 288x 2.763 ms 1.24% 2.756 ms 1.25% 464.366M 464.366 MB/s 0.02% 442
100 4 64 4588 8144x 68.759 us 2.79% 61.446 us 3.00% 74.667M 74.667 MB/s 0.00% 71
1000 4 64 45568 2448x 303.213 us 4.18% 295.836 us 4.27% 154.031M 154.031 MB/s 0.01% 146
10000 4 64 459604 1693x 2.568 ms 3.54% 2.561 ms 3.55% 179.481M 179.481 MB/s 0.01% 171
100 8 64 9388 7104x 78.043 us 7.20% 70.494 us 7.94% 133.174M 133.174 MB/s 0.01% 127
1000 8 64 93568 1648x 313.579 us 1.69% 306.027 us 1.73% 305.751M 305.751 MB/s 0.02% 291
10000 8 64 939604 1024x 2.600 ms 1.34% 2.592 ms 1.34% 362.498M 362.498 MB/s 0.02% 345
100 16 64 18988 7120x 77.703 us 5.05% 70.293 us 5.51% 270.126M 270.126 MB/s 0.01% 257
1000 16 64 189568 3184x 338.151 us 10.95% 330.565 us 11.19% 573.467M 573.467 MB/s 0.03% 546
10000 16 64 1899604 512x 2.766 ms 1.27% 2.758 ms 1.27% 688.756M 688.756 MB/s 0.04% 656
100 4 128 7380 8128x 69.045 us 2.52% 61.533 us 2.77% 119.935M 119.935 MB/s 0.01% 114
1000 4 128 73280 3168x 302.243 us 7.10% 294.720 us 7.27% 248.643M 248.643 MB/s 0.01% 237
10000 4 128 739340 1056x 2.549 ms 1.18% 2.541 ms 1.19% 290.920M 290.920 MB/s 0.02% 277
100 8 128 15380 7072x 78.109 us 6.49% 70.744 us 7.07% 217.403M 217.403 MB/s 0.01% 207
1000 8 128 153280 2240x 314.160 us 2.59% 306.666 us 2.44% 499.828M 499.828 MB/s 0.03% 476
10000 8 128 1539340 1200x 2.632 ms 5.31% 2.624 ms 5.32% 586.573M 586.573 MB/s 0.03% 559
100 16 128 31380 7008x 78.849 us 1.79% 71.510 us 1.94% 438.819M 438.819 MB/s 0.02% 418
1000 16 128 313280 1520x 338.739 us 1.35% 331.401 us 1.38% 945.321M 945.321 MB/s 0.05% 901
10000 16 128 3139340 784x 2.765 ms 1.31% 2.758 ms 1.32% 1.138G 1.138 GB/s 0.06% 1085

shrshi avatar Jun 25 '25 21:06 shrshi

/ok to test 49f26de

shrshi avatar Jun 25 '25 21:06 shrshi

Thank you for the suggestion - added a benchmark for this.

Thank you for adding this. Can we also run this on branch-25.08 and compare performance. Recommended to remove redundant columns like Samples, Noise.

Also maybe a dumb question, what is BWUtil column and why is it always 0%? Maybe another dumb question, is expected_masks_per_segment the average number of masks per segment? If so, can we just name it avg_masks_per_segment just so the column is not that wide.

Edit: Not intending to increase work for you (😄) but if you compare the performance against branch-25.08, it would be nice if you could just combine results into one table something like this

mhaseeb123 avatar Jun 25 '25 22:06 mhaseeb123

Thank you for the suggestion - added a benchmark for this.

Thank you for adding this. Can we also run this on branch-25.08 and compare performance. Recommended to remove redundant columns like Samples, Noise.

We cannot directly compare with main since a multi-segment approach did not exist for bitwise operations. We can put together a baseline implementation using the bitmask_binop in 25.08 and process segments sequentially in a loop. I have introduced another benchmark multi_segment_bitwise_and which implements this approach.

Also maybe a dumb question, what is BWUtil column and why is it always 0%? Maybe another dumb question, is expected_masks_per_segment the average number of masks per segment? If so, can we just name it avg_masks_per_segment just so the column is not that wide.

BWUtil is global memory utilization as a percentage of global memory bandwidth - it is really low for this kernel since our input sizes are still very small. I'm using expected_masks_per_segment since I'm considering this axis parameter to be the mean of the gaussian distribution for segment sizes.

Edit: Not intending to increase work for you (😄) but if you compare the performance against branch-25.08, it would be nice if you could just combine results into one table something like this

num_segments expected_masks_per_segment mask_size_bits Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
100 4 32 3.117 ms 8.47% 123.009 us 53.29% -2994.196 us -96.05% FAST
1000 4 32 30.138 ms 3.80% 876.038 us 1.25% -29261.583 us -97.09% FAST
10000 4 32 300.686 ms 0.77% 8.420 ms 4.52% -292266.573 us -97.20% FAST
100 8 32 3.592 ms 5.48% 129.307 us 1.59% -3462.461 us -96.40% FAST
1000 8 32 35.980 ms 0.17% 885.215 us 0.96% -35094.974 us -97.54% FAST
10000 8 32 359.420 ms 0.50% 8.492 ms 2.98% -350928.364 us -97.64% FAST
100 16 32 3.663 ms 0.41% 130.078 us 1.47% -3532.741 us -96.45% FAST
1000 16 32 36.740 ms 0.14% 913.124 us 1.17% -35826.887 us -97.51% FAST
10000 16 32 363.865 ms 0.33% 8.605 ms 2.78% -355260.631 us -97.64% FAST
100 4 64 3.005 ms 1.01% 120.781 us 2.20% -2884.040 us -95.98% FAST
1000 4 64 29.456 ms 0.29% 879.138 us 1.25% -28577.188 us -97.02% FAST
10000 4 64 296.477 ms 0.08% 8.345 ms 1.31% -288131.614 us -97.19% FAST
100 8 64 3.539 ms 5.05% 129.508 us 2.05% -3409.795 us -96.34% FAST
1000 8 64 35.390 ms 0.33% 884.424 us 1.00% -34505.961 us -97.50% FAST
10000 8 64 357.183 ms 0.08% 8.521 ms 0.92% -348662.507 us -97.61% FAST
100 16 64 3.639 ms 0.44% 131.232 us 15.66% -3508.043 us -96.39% FAST
1000 16 64 36.358 ms 0.50% 916.605 us 0.76% -35441.242 us -97.48% FAST
10000 16 64 361.185 ms 0.23% 8.657 ms 2.77% -352527.788 us -97.60% FAST
100 4 128 2.999 ms 1.06% 121.859 us 2.53% -2876.852 us -95.94% FAST
1000 4 128 29.471 ms 1.54% 876.995 us 1.02% -28593.991 us -97.02% FAST
10000 4 128 296.791 ms 0.09% 8.421 ms 1.18% -288369.912 us -97.16% FAST
100 8 128 3.531 ms 0.63% 130.801 us 1.51% -3399.740 us -96.30% FAST
1000 8 128 35.327 ms 0.21% 882.967 us 2.87% -34443.824 us -97.50% FAST
10000 8 128 358.118 ms 0.22% 8.450 ms 1.12% -349668.020 us -97.64% FAST
100 16 128 3.644 ms 0.38% 130.875 us 1.41% -3512.854 us -96.41% FAST
1000 16 128 36.375 ms 0.14% 918.422 us 0.76% -35456.856 us -97.48% FAST
10000 16 128 361.774 ms 0.26% 8.634 ms 0.75% -353140.489 us -97.61% FAST

shrshi avatar Jun 26 '25 02:06 shrshi

/ok to test 7d64717

shrshi avatar Jun 27 '25 22:06 shrshi

/ok to test 5b3b51e

shrshi avatar Jun 30 '25 23:06 shrshi

/merge

shrshi avatar Jul 01 '25 15:07 shrshi