cudf Add multithreaded parquet reader benchmarks.

trafficstars

Addresses: https://github.com/rapidsai/cudf/issues/12700

Adds multithreaded benchmarks for the parquet reader. Separate benchmarks for the chunked and non-chunked readers. In both cases, the primary cases are 2, 4 and 8 threads running reads at the same time. There is not a ton of variability in the other benchmarking axes.

The primary use of this particular benchmark is to see inter-kernel performance (that is, how well do our many different kernel types coexist with each other). Whereas normal benchmarks tend to be more for intra-kernel performance checking.

NVTX ranges are included to help visually group the bundles of reads together in nsight-sys. I also posted a new issue which would help along these lines: https://github.com/rapidsai/cudf/issues/15575

Checklist

[x] I am familiar with the Contributing Guidelines.
[ ] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

Apr 23 '24 17:04 nvdbaranec

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Apr 23 '24 17:04 copy-pr-bot[bot]

One thing of note. I'm tempted to go larger with some of the sizes here because I want to make sure we're saturating the GPU as much as we can, but using sizes much larger than this (especially in the chunked case) causes total usage to quickly blow past 16 GB. I don't know how much we care about that.

Apr 23 '24 17:04 nvdbaranec

What kind of runtime is this? How does performance look? Are we able to catch regressions like https://github.com/rapidsai/cudf/pull/14167?

Apr 24 '24 19:04 hyperbolic2346

@nvdbaranec BTW @vuule and I were discussing that maybe we need to introduce a pinned host buffer data source to show better scaling with this kind of multi-threaded benchmark.

May 16 '24 17:05 GregoryKimball

/ok to test

May 20 '24 18:05 nvdbaranec

/ok to test

May 21 '24 18:05 nvdbaranec

/ok to test

May 21 '24 21:05 nvdbaranec

/merge

May 21 '24 22:05 nvdbaranec

Thank you @nvdbaranec, these benchmarks are excellent!

Here are the settings that I recently used to study the interleaving of copy and copy on A100 (+ @vuule)

./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=16000000000 --axis num_threads=16

I decided to scale the total_data_size with the num_threads so that I could compare the throughput for 1 thread to read 1 GB with the throughput for 10 threads to read 10 GB.

Beautiful results!

Jun 03 '24 22:06 GregoryKimball

cudf cudf copied to clipboard

Add multithreaded parquet reader benchmarks.

Checklist

cudf
cudf copied to clipboard