Is your feature request related to a problem? Please describe.

We use specialized kernels to process uncompressed, encoded parquet data pages into arrow-formatted cuDF columns. The throughput achieved by the decoding kernels is a function of data type, and many data types show <50 GB/s of throughput on H100.

generic decode kernel containing decode device functions, includes struct, list, string, fixed width, dictionary device functions
delta byte array kernel for strings
delta length byte array kernel for strings

You can generate this data using this command:

 ./PARQUET_READER_NVBENCH -b 0 -a cardinality=0 -a run_length=1 --timeout 0.2

Nsight Systems Profiles

Let's look at nsys profiles for each data type.

INTEGER

Shows three kernels (total 4 ms) of gpuDecodePageDataGeneric. Peak warp occupancy is ~50%, with drops at the end.

FLOAT

Shows one 1.7 ms kernel of gpuDecodePageDataGeneric. Peak warp occupancy is ~50%, with drops at the end.

BOOL8

Shows one 8 ms kernel of gpuDecodePageDataGeneric. Peak warp occupancy is ~50%, with drops at the end.

DECIMAL

Shows 8.3 ms of unsnap (we should remove this, see #18349). Shows 1.3 ms of gpuDecodePageDataGeneric

TIMESTAMP

Shows 8.3 ms of unsnap (we should remove this, see #18349). Shows 1.7 ms over two calls of gpuDecodePageDataGeneric

DURATION

Shows 1.6 ms of unsnap (we should remove this, see #18349). Shows 1.7 ms over two calls of gpuDecodePageDataGeneric

STRING

Shows 10 ms of unsnap (we should remove this, see #18349). Shows 0.7 ms in gpuComputeStringPageBounds and 9.6 ms in gpuDecodePageDataGeneric.

LIST

Shows 3.1 ms of unsnap, 6.2 ms of gpuComputePageSizes and 8.4 ms of gpuDecodePageDataGeneric. Plus 9.5 ms of cub::reduce, something like 120 calls. Is this a per-page reduction? (Needs attribution)

STRUCT

Shows 9.3 ms in unsnap, 6 ms in gpuDecodePageDataGeneric over two calls, and 29 ms in recursive calls to superimpose_nulls_no_sanitize.

Describe the solution you'd like I believe we should target at least 100 GB/s throughput in the decode stage. We may need new approaches such as state machines based on the libcudf FST, intra-warp coordination, better usage of shared memory and L2 cache, and more ideas.

Opportunities:

For strings, can we precompute more pieces similar to gpuComputeStringPageBounds so that the decode can run faster than 9.6 ms?
For lists, can we fuse any of the cub::DeviceReduce::Reduce calls?
For structs, can we find a more efficient approach to ensure nulls sanitization? 29 ms is a very long time

Mar 21 '25 20:03 GregoryKimball

list and struct changes are related to https://github.com/rapidsai/cudf/issues/17356

After discussion today with @shrshi, we suspect that the poor sanitization performance could actually be a bug. Perhaps we should keep sanitizing, but design a method that can run at >100 GB/s

May 08 '25 16:05 GregoryKimball

Hi @GregoryKimball ,

Quick question. I found this performance benchmark intriguing and would appreciate your thoughts when you have a moment.

I’m starting with simple data types like integers and floats. Integers seem quite common in CPU-Parquet-writers, so I’m curious why the integer throughput is only half that of floats.

Given that GPUs typically offer massive INT GOPS, I’d expected more comparable performance between integers and floats.

Jun 02 '25 10:06 JigaoLuo

There's an ongoing effort to remove purging of non-empty nulls from the column factories - https://github.com/rapidsai/cudf/pull/12873. Should further improve throughput for lists and structs columns.

Aug 26 '25 18:08 vuule

I’m still thinking this through—currently staying at a high level while exploring ways to improve warp occupancy. I’m using my own reader implementation from a previous issue, which is built on pipelines.

One key observation is that the number of pages per columnchunk affects the kernel block size, which in turn affects the decoding kernel's running time. And blocks accumulate across columns, e.g. reading 4 columns with 200 pages each results in 4×200 = 800 blocks. Code ref: https://github.com/rapidsai/cudf/blob/7362ce261962c62a406402d7c7b965d3bcbaf7a5/cpp/src/io/parquet/page_delta_decode.cu#L921

In my early experiments, I kept all variables constant (e.g., row group size) and only varied the number of pages. Here’s what I observed:

Low page count (~10): Reading is slow, as the number of blocks launched is below the number of available GPU SMs—this makes sense and is expected.
Medium page count (50–100): I/O starts to saturate, but warp occupancy remains low.
High page count (>200): The number of blocks far exceeds SM count. End-to-end read time doesn’t improve due to I/O-bound, but warp occupancy increases slightly. Each decoding kernel runs shorter, allowing other pipelined kernels to be scheduled earlier—this side effect improves overall pipeline smoothness.

This study is purely focused on page count, without modifying any decoding code. While the storage throughput doesn’t improve (due to I/O-bounded), the higher page count indirectly benefits scheduling of other kernels—an unexpected but welcome effect. I’m not sure how to plot my observation here, since storage performance remains flat across page counts. But I’d be happy to share my nsys

Sep 17 '25 11:09 JigaoLuo

Thank you @JigaoLuo for mentioning the page size affect.

I've also observed benefits with smaller page sizes, around 128 KiB as the best size. We saw a big benefit in decompression with nvCOMP 4.2, but nvCOMP 5.0 is less sensitive to page size and gets improved throughput even with larger pages (256 KiB-512 KiB). I have also seen decoding benefits from smaller pages, but these benefits were not large enough to show up in my top-line query execution numbers.

As of Sep 2025, we write with a page size default target of 512 KiB and pyarrow writes with a default target of 1 MB. We might drop our default at some point if the performance benefits are worth deviating further from the conventions of other writers.

Sep 17 '25 15:09 GregoryKimball

@JigaoLuo There's a PR up that updates decode block sizes for better throughput that might interest you. https://github.com/rapidsai/cudf/pull/19967

Unfortunately, the warp occupancy in parquet decode kernels is limited by the amount of exploitable parallelism inherent to the encodings so the observations make sense to me

Sep 17 '25 18:09 mhaseeb123

Hi @GregoryKimball, Thanks—I think this is genuinely valuable. I’m curious whether any of this is documented elsewhere as a guideline for generating cuDF-friendly Parquet files. Personally, I only discovered these insights through extensive profiling with nsys and reading the cudf code as mentioned above.

One comment on such a potential guideline: I’ve been rewriting parquet using arrow-rs, not cuDF. And also realistically, many Parquet files are created without cudf—last time I checked, HuggingFace datasets were written using PyArrow.

This makes the need for a guideline even more attractive. Some well-known Parquet writers (I won’t name them here) generate row groups with only a single page, which is highly suboptimal for cuDF. And in such cases, users might mistakenly blame the cuDF reader for poor performance, overlooking that the blame should be the Parquet file layout itself.

I strongly believe a guideline like this would be extremely helpful, and I’d be happy to contribute if one doesn’t already exist.

Comment on 128KiB page size: Thanks again for the insight. In my custom reader (built from my story issue), I read each RG per read_parquet call, and each decoding kernel typically runs in ~2ms. I’m still experimenting with optimal page sizing for decoding, but my current guess is that smaller pages—likely under 128KB—might work best for my setup. I’ll share more details about page layout and my own reader in the future.

Sep 18 '25 20:09 JigaoLuo

Hi @GregoryKimball and @mhaseeb123 thanks again—as always—for the support I’ve received. I see the optimization objective as two-fold. One side focuses on decoding bandwidth, which we all care about in this issue. The other side I care, though less concrete and still discovering, is about reducing launch latency for non-cuDF kernels. As mentioned above:

High page count (>200): The number of blocks far exceeds SM count. End-to-end read time doesn’t improve due to I/O-bound, but warp occupancy increases slightly. Each decoding kernel runs shorter, allowing other pipelined kernels to be scheduled earlier—this side effect improves overall pipeline smoothness.

That’s what led me to investigate this further. In a multi-stream setup, I noticed that poorly configured page layouts can cause non-CuDF kernels to experience long delays before being launched. These kernels end up waiting unnecessarily, which impacts overall pipeline efficiency.

Sep 18 '25 20:09 JigaoLuo

[FEA] Improve Parquet decoding throughput in libcudf

Nsight Systems Profiles

INTEGER

FLOAT

BOOL8

DECIMAL

TIMESTAMP

DURATION

STRING

LIST

STRUCT