cudf
cudf copied to clipboard
Add multithreaded parquet reader benchmarks.
Addresses: https://github.com/rapidsai/cudf/issues/12700
Adds multithreaded benchmarks for the parquet reader. Separate benchmarks for the chunked and non-chunked readers. In both cases, the primary cases are 2, 4 and 8 threads running reads at the same time. There is not a ton of variability in the other benchmarking axes.
The primary use of this particular benchmark is to see inter-kernel performance (that is, how well do our many different kernel types coexist with each other). Whereas normal benchmarks tend to be more for intra-kernel performance checking.
NVTX ranges are included to help visually group the bundles of reads together in nsight-sys. I also posted a new issue which would help along these lines: https://github.com/rapidsai/cudf/issues/15575
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [ ] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
One thing of note. I'm tempted to go larger with some of the sizes here because I want to make sure we're saturating the GPU as much as we can, but using sizes much larger than this (especially in the chunked case) causes total usage to quickly blow past 16 GB. I don't know how much we care about that.
What kind of runtime is this? How does performance look? Are we able to catch regressions like https://github.com/rapidsai/cudf/pull/14167?
@nvdbaranec BTW @vuule and I were discussing that maybe we need to introduce a pinned host buffer data source to show better scaling with this kind of multi-threaded benchmark.
/ok to test
/ok to test
/ok to test
/merge
Thank you @nvdbaranec, these benchmarks are excellent!
Here are the settings that I recently used to study the interleaving of copy and copy on A100 (+ @vuule)
./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=16000000000 --axis num_threads=16
I decided to scale the total_data_size with the num_threads so that I could compare the throughput for 1 thread to read 1 GB with the throughput for 10 threads to read 10 GB.
Beautiful results!