Description

Partially addresses https://github.com/rapidsai/cudf/issues/18350

This PR builds on the performance improvements in https://github.com/rapidsai/cudf/pull/19014. Instead of sequentially superimposing the top-level struct column null mask over the descendant columns, this PR introduces a new kernel that parallelizes the operation at two levels - (i) over all struct columns in the input table, and (ii) for each struct column, over all the root-to-leaf paths in the column hierarchy.

Checklist

[X] I am familiar with the Contributing Guidelines.
[X] New or existing tests cover these changes.
[X] The documentation is up to date with these changes.

Jun 03 '25 23:06 shrshi

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Jun 03 '25 23:06 copy-pr-bot[bot]

Profiles

Input table: Table with ~306k rows and 64 struct columns

`./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT` (https://github.com/rapidsai/cudf/pull/19014)

~24ms spent in superimpose+sanitize nulls. Note that the operation is performed for each column separately. Moreover, the superimpose step is performed sequentially as we traverse the descendants of the struct column i.e. the null masks are superimposed iteratively for each descendant node as we traverse the struct hierarchy.

`./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT` (current PR)

~8ms spent in superimpose+sanitize with the multi-column optimization! The 3x speedup is primarily due to kernel fusion - the multi-column and multi-path - for within-column superimpose - kernel does all the necessary superimpose operations in parallel with a single kernel.

Jun 11 '25 21:06 shrshi

/ok to test 19c37f2

Jun 13 '25 01:06 shrshi

/ok to test 19c37f2

Apparently your commits are signed. I wonder if you already are cpp-codeowner?

Jun 13 '25 03:06 ttnghia

/ok to test 19c37f2

Apparently your commits are signed. I wonder if you already are cpp-codeowner?

There's one that's not verified :(

Jun 13 '25 19:06 vuule

Benchmark Results

~33% increase in throughput

parquet_read_decode (current PR)

[0] NVIDIA A100 80GB PCIe

data_type	io_type	cardinality	run_length	data_size	Samples	CPU Time	Noise	GPU Time	Noise	bytes_per_second	peak_memory_usage	encoded_file_size
STRUCT	DEVICE_BUFFER	0	1	536870912	384x	38.566 ms	1.55%	38.557 ms	1.55%	13924220177	1.392 GiB	499.298 MiB

parquet_read_decode (pre-https://github.com/rapidsai/cudf/pull/19014)

[0] NVIDIA A100 80GB PCIe

data_type	io_type	cardinality	run_length	data_size	Samples	CPU Time	Noise	GPU Time	Noise	bytes_per_second	peak_memory_usage	encoded_file_size
STRUCT	DEVICE_BUFFER	0	1	536870912	9x	56.910 ms	0.19%	56.901 ms	0.19%	9435111923	1.380 GiB	499.298 MiB

Jun 13 '25 20:06 shrshi

/ok to test 19c37f2

Apparently your commits are signed. I wonder if you already are cpp-codeowner?

There's one that's not verified :(

Yes, my gitconfig is not enforcing signing on the merge commits somehow. I need to fix this :(

Jun 13 '25 21:06 shrshi

Yes, my gitconfig is not enforcing signing on the merge commits somehow. I need to fix this :(

That happens when you update your branch outside of cursor/vscode and then forget to pull before commiting anything to your branch in vscode/cursor. That usually leads to an unsigned commit + an automatic merge commit when you push

Jun 13 '25 21:06 mhaseeb123

Would love to see a benchmark for segmented_offset_bitmask_binop sweeping over bitmasks_per_segment and destination_size/source_size_bits axes. It might be worth using a block instead of warp per segment at some point.

Thank you for the suggestion - added a benchmark for this.

Benchmark Results

segmented_bitmask_and

[0] NVIDIA A100 80GB PCIe

num_segments	expected_masks_per_segment	mask_size_bits	input size	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	GlobalMem BW	BWUtil	Mbytes_per_second
100	4	32	3192	7456x	74.779 us	96.05%	67.095 us	107.04%	47.574M	47.574 MB/s	0.00%	45
1000	4	32	31712	2704x	304.538 us	32.52%	297.200 us	33.33%	106.703M	106.703 MB/s	0.01%	101
10000	4	32	319736	1708x	2.555 ms	4.64%	2.547 ms	4.66%	125.511M	125.511 MB/s	0.01%	119
100	8	32	6392	7136x	77.484 us	6.25%	70.161 us	6.92%	91.104M	91.104 MB/s	0.00%	86
1000	8	32	63712	1648x	313.509 us	1.89%	306.148 us	1.94%	208.109M	208.109 MB/s	0.01%	198
10000	8	32	639736	1680x	2.640 ms	4.04%	2.633 ms	4.05%	243.003M	243.003 MB/s	0.01%	231
100	16	32	12792	6992x	79.079 us	5.52%	71.540 us	5.99%	178.809M	178.809 MB/s	0.01%	170
1000	16	32	127712	2368x	340.510 us	6.43%	332.978 us	6.56%	383.545M	383.545 MB/s	0.02%	365
10000	16	32	1279736	288x	2.763 ms	1.24%	2.756 ms	1.25%	464.366M	464.366 MB/s	0.02%	442
100	4	64	4588	8144x	68.759 us	2.79%	61.446 us	3.00%	74.667M	74.667 MB/s	0.00%	71
1000	4	64	45568	2448x	303.213 us	4.18%	295.836 us	4.27%	154.031M	154.031 MB/s	0.01%	146
10000	4	64	459604	1693x	2.568 ms	3.54%	2.561 ms	3.55%	179.481M	179.481 MB/s	0.01%	171
100	8	64	9388	7104x	78.043 us	7.20%	70.494 us	7.94%	133.174M	133.174 MB/s	0.01%	127
1000	8	64	93568	1648x	313.579 us	1.69%	306.027 us	1.73%	305.751M	305.751 MB/s	0.02%	291
10000	8	64	939604	1024x	2.600 ms	1.34%	2.592 ms	1.34%	362.498M	362.498 MB/s	0.02%	345
100	16	64	18988	7120x	77.703 us	5.05%	70.293 us	5.51%	270.126M	270.126 MB/s	0.01%	257
1000	16	64	189568	3184x	338.151 us	10.95%	330.565 us	11.19%	573.467M	573.467 MB/s	0.03%	546
10000	16	64	1899604	512x	2.766 ms	1.27%	2.758 ms	1.27%	688.756M	688.756 MB/s	0.04%	656
100	4	128	7380	8128x	69.045 us	2.52%	61.533 us	2.77%	119.935M	119.935 MB/s	0.01%	114
1000	4	128	73280	3168x	302.243 us	7.10%	294.720 us	7.27%	248.643M	248.643 MB/s	0.01%	237
10000	4	128	739340	1056x	2.549 ms	1.18%	2.541 ms	1.19%	290.920M	290.920 MB/s	0.02%	277
100	8	128	15380	7072x	78.109 us	6.49%	70.744 us	7.07%	217.403M	217.403 MB/s	0.01%	207
1000	8	128	153280	2240x	314.160 us	2.59%	306.666 us	2.44%	499.828M	499.828 MB/s	0.03%	476
10000	8	128	1539340	1200x	2.632 ms	5.31%	2.624 ms	5.32%	586.573M	586.573 MB/s	0.03%	559
100	16	128	31380	7008x	78.849 us	1.79%	71.510 us	1.94%	438.819M	438.819 MB/s	0.02%	418
1000	16	128	313280	1520x	338.739 us	1.35%	331.401 us	1.38%	945.321M	945.321 MB/s	0.05%	901
10000	16	128	3139340	784x	2.765 ms	1.31%	2.758 ms	1.32%	1.138G	1.138 GB/s	0.06%	1085

Jun 25 '25 21:06 shrshi

/ok to test 49f26de

Jun 25 '25 21:06 shrshi

Thank you for the suggestion - added a benchmark for this.

Thank you for adding this. Can we also run this on branch-25.08 and compare performance. Recommended to remove redundant columns like Samples, Noise.

Also maybe a dumb question, what is BWUtil column and why is it always 0%? Maybe another dumb question, is expected_masks_per_segment the average number of masks per segment? If so, can we just name it avg_masks_per_segment just so the column is not that wide.

Edit: Not intending to increase work for you (😄) but if you compare the performance against branch-25.08, it would be nice if you could just combine results into one table something like this

Jun 25 '25 22:06 mhaseeb123

Thank you for the suggestion - added a benchmark for this.

Thank you for adding this. Can we also run this on branch-25.08 and compare performance. Recommended to remove redundant columns like Samples, Noise.

We cannot directly compare with main since a multi-segment approach did not exist for bitwise operations. We can put together a baseline implementation using the bitmask_binop in 25.08 and process segments sequentially in a loop. I have introduced another benchmark multi_segment_bitwise_and which implements this approach.

Also maybe a dumb question, what is BWUtil column and why is it always 0%? Maybe another dumb question, is expected_masks_per_segment the average number of masks per segment? If so, can we just name it avg_masks_per_segment just so the column is not that wide.

BWUtil is global memory utilization as a percentage of global memory bandwidth - it is really low for this kernel since our input sizes are still very small. I'm using expected_masks_per_segment since I'm considering this axis parameter to be the mean of the gaussian distribution for segment sizes.

Edit: Not intending to increase work for you (😄) but if you compare the performance against branch-25.08, it would be nice if you could just combine results into one table something like this

num_segments	expected_masks_per_segment	mask_size_bits	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
100	4	32	3.117 ms	8.47%	123.009 us	53.29%	-2994.196 us	-96.05%	FAST
1000	4	32	30.138 ms	3.80%	876.038 us	1.25%	-29261.583 us	-97.09%	FAST
10000	4	32	300.686 ms	0.77%	8.420 ms	4.52%	-292266.573 us	-97.20%	FAST
100	8	32	3.592 ms	5.48%	129.307 us	1.59%	-3462.461 us	-96.40%	FAST
1000	8	32	35.980 ms	0.17%	885.215 us	0.96%	-35094.974 us	-97.54%	FAST
10000	8	32	359.420 ms	0.50%	8.492 ms	2.98%	-350928.364 us	-97.64%	FAST
100	16	32	3.663 ms	0.41%	130.078 us	1.47%	-3532.741 us	-96.45%	FAST
1000	16	32	36.740 ms	0.14%	913.124 us	1.17%	-35826.887 us	-97.51%	FAST
10000	16	32	363.865 ms	0.33%	8.605 ms	2.78%	-355260.631 us	-97.64%	FAST
100	4	64	3.005 ms	1.01%	120.781 us	2.20%	-2884.040 us	-95.98%	FAST
1000	4	64	29.456 ms	0.29%	879.138 us	1.25%	-28577.188 us	-97.02%	FAST
10000	4	64	296.477 ms	0.08%	8.345 ms	1.31%	-288131.614 us	-97.19%	FAST
100	8	64	3.539 ms	5.05%	129.508 us	2.05%	-3409.795 us	-96.34%	FAST
1000	8	64	35.390 ms	0.33%	884.424 us	1.00%	-34505.961 us	-97.50%	FAST
10000	8	64	357.183 ms	0.08%	8.521 ms	0.92%	-348662.507 us	-97.61%	FAST
100	16	64	3.639 ms	0.44%	131.232 us	15.66%	-3508.043 us	-96.39%	FAST
1000	16	64	36.358 ms	0.50%	916.605 us	0.76%	-35441.242 us	-97.48%	FAST
10000	16	64	361.185 ms	0.23%	8.657 ms	2.77%	-352527.788 us	-97.60%	FAST
100	4	128	2.999 ms	1.06%	121.859 us	2.53%	-2876.852 us	-95.94%	FAST
1000	4	128	29.471 ms	1.54%	876.995 us	1.02%	-28593.991 us	-97.02%	FAST
10000	4	128	296.791 ms	0.09%	8.421 ms	1.18%	-288369.912 us	-97.16%	FAST
100	8	128	3.531 ms	0.63%	130.801 us	1.51%	-3399.740 us	-96.30%	FAST
1000	8	128	35.327 ms	0.21%	882.967 us	2.87%	-34443.824 us	-97.50%	FAST
10000	8	128	358.118 ms	0.22%	8.450 ms	1.12%	-349668.020 us	-97.64%	FAST
100	16	128	3.644 ms	0.38%	130.875 us	1.41%	-3512.854 us	-96.41%	FAST
1000	16	128	36.375 ms	0.14%	918.422 us	0.76%	-35456.856 us	-97.48%	FAST
10000	16	128	361.774 ms	0.26%	8.634 ms	0.75%	-353140.489 us	-97.61%	FAST

Jun 26 '25 02:06 shrshi

/ok to test 7d64717

Jun 27 '25 22:06 shrshi

/ok to test 5b3b51e

Jun 30 '25 23:06 shrshi

/merge

Jul 01 '25 15:07 shrshi

Multi-column null sanitization for struct columns

Description

Checklist

Profiles

./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT (https://github.com/rapidsai/cudf/pull/19014)

./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT (current PR)

Benchmark Results

parquet_read_decode (current PR)

[0] NVIDIA A100 80GB PCIe

parquet_read_decode (pre-https://github.com/rapidsai/cudf/pull/19014)

[0] NVIDIA A100 80GB PCIe

Benchmark Results

segmented_bitmask_and

[0] NVIDIA A100 80GB PCIe

`./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT` (https://github.com/rapidsai/cudf/pull/19014)

`./PARQUET_READER_NVBENCH --benchmark parquet_read_decode -a cardinality=0 -a run_length=1 --axis data_type=STRUCT` (current PR)