Replace thrust reductions in Parquet reader with CUB + pinned memory based implementations
Description
Contributes to #20722
This PR replaces the use of thrust::reduce and thrust::transform_reduce functions that incur a pageable D2H copy with custom CUB + pinned memory based implementations.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.
Contributors can view more details about this message here.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
pre-commit.ci autofix
Another request: please write a CCCL feature request for Thrust to support returning host values via pinned memory. It will help them to know that returning pageable memory + syncing is a pain point for us, even if we don't choose to adopt it in the near/medium term.
Ran Parquet reader benchmarks (single, multithreaded and filters) and saw no change in performance with this PR
/ok to test 50dd93e
/merge