[DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading
Description
For the issue https://github.com/rapidsai/cudf/issues/18967, this is a PR Draft aimed at removing all unnecessary synchronization points (termed "miss-sync") in the Parquet reader. Please hold off on merging this PR draft. The plan is to split it into smaller PRs for the actual merge.
TL;DR: This is the performance gain in scalability :rocket: once the future small PRs from this draft are merged:
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
Outcome
With the current code in this PR Draft, reading parquet has no memcpy on pageable memory. Here’s the outcome of this PR draft, comparing the number of "miss-sync" against the latest branch-25-08:
Comparison: Expert Systems View in Nsys:
In this PR draft:
$ nsys analyze -r cuda_memcpy_async "/tmp/no-misssync.sqlite"
Processing [/tmp/no-misssync.sqlite] with [/opt/nsight-systems-20250101/host-linux-x64/rules/cuda_memcpy_async.py]...
** CUDA Async Memcpy with Pageable Memory (cuda_memcpy_async):
There were no problems detected related to memcpy operations using pageable
memory.
In the latest branch-25-08:
$ nsys analyze -r cuda_memcpy_async "/tmp/branch2508.sqlite"
Processing [/tmp/branch2508.sqlite] with [/opt/nsight-systems-20250101/host-linux-x64/rules/cuda_memcpy_async.py]...
** CUDA Async Memcpy with Pageable Memory (cuda_memcpy_async):
The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy
operations to block and be executed synchronously. This leads to low GPU
utilization.
Suggestion: If applicable, use PINNED memory instead.
Duration (ns) Start (ns) Src Kind Dst Kind Bytes (MB) PID Device ID Context ID Green Context ID Stream ID API Name
------------- -------------- -------- -------- ---------- --------- --------- ---------- ---------------- --------- --------------------------
6,432 9,395,999,168 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
3,456 9,080,770,580 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
3,232 9,679,481,106 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
3,168 9,081,092,563 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,592 9,084,669,960 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,432 9,082,044,304 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,432 9,084,155,818 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,432 9,084,374,089 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,400 9,084,414,249 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,400 9,084,450,409 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,400 9,084,542,632 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,400 9,084,602,664 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,400 9,084,908,807 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,400 9,956,004,698 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,368 9,081,670,354 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,368 9,084,627,336 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,368 9,085,019,847 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,368 9,955,974,299 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,272 9,365,887,457 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,272 9,366,435,775 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,240 9,679,508,786 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,208 9,956,095,418 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,112 9,365,906,657 Device Pageable 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
2,080 9,366,408,511 Device Pageable 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
2,080 9,395,470,882 Pageable Device 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
1,888 9,391,287,568 Pageable Device 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
1,088 9,395,902,593 Pageable Device 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
1,088 9,400,020,212 Pageable Device 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
1,056 10,200,105,651 Pageable Device 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
1,024 10,200,112,371 Pageable Device 0.000 1,521,787 0 1 13 cudaMemcpyAsync_ptsz_v7000
1,024 10,366,822,916 Pageable Device 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
1,024 10,366,834,276 Pageable Device 0.000 1,521,787 0 1 14 cudaMemcpyAsync_ptsz_v7000
The workload is only read a TPC-H SF10 Lineitem via 2 threads: nsys profile ./parquet_io_multithreaded lineitem-sf10.parquet 1 FILEPATH 1 2
Goal&Fix: No miss-sync
The core goal of this PR Draft is to eliminate pageable-memcpy and miss-sync in the parquet reading path, enabling seamless multi-streaming and pipelining without synchronization bottlenecks. While my initial private implementation focused on pure in-place fixes for performance only, this PR has a refactored code. This involves modifying underlying header files (e.g., cpp/include/cudf/detail/utilities/vector_factories.hpp and cpp/include/cudf/detail/utilities/cuda_memcpy.hpp ) to address issues at their root, rather than applying isolated fixes within the Parquet codebase. I acknowledge that this refactored style may complicate code review and require enhanced test coverage, which I’m prepared to address based on feedback. Please advise whether you prefer the current refactored approach or a more localized in-place fix style.
The changes also raise considerations for unit testing. Since the in-place fix style alters only internal Parquet functions, it might require minimal new testing, whereas the refactored approach, by modifying shared headers as vector_factories.hpp, would require additional test cases. I will defer detailed unit tests until the first code-style discussion is finalized.
Takeaway
You can read the long version of takeaway here: https://github.com/rapidsai/cudf/issues/18967
As a general takeaway for fixing miss-sync:
- Avoid thrust::reduce and some similar thrust functions; use CUB primitives.
- Replace
rmm::device_scalarwithcudf::detail::device_scalarfor its bounce buffers for efficient data copying. - Eliminate
rmm::device_vector::element(...)calls as it relies on pageable memory (a limitation best resolved within RMM itself). - (I will update this list as further insights emerge during the review process.)
Let me know your thoughts on the approach, and I’ll adjust the PR accordingly.
Here are the remaining technical questions from my side (I left them them there instead of in code review):
- When replacing
rmm::device_scalarwithcudf::detail::device_scalarin onlycpp/include/cudf/reduction/detail/reduction.cuh, the code compiles but fails at linking with undefined symbol errors:
/home/jluo/cudf-dev/cpp/examples/parquet_io/build/./parquet_io_multithreaded: symbol lookup error: /home/jluo/miniconda3/envs/cudf_dev/lib/libcudf.so: undefined symbol: _ZN4cudf16timestamp_scalarIN4cuda3std3__46chrono10time_pointINS4_12system_clockENS4_8durationIlNS3_5ratioILl1ELl1000EEEEEEEEC1INS_6detail13device_scalarISB_EEEERKT_bN3rmm16cuda_stream_viewENS1_2mr3__418basic_resource_refILNSN_10_AllocTypeE1EJNSN_17device_accessibleEEEE
[Update & Self-Note]: This indicates that a partial replacement is insufficient. A comprehensive replacement from rmm::device_scalar to cudf::detail::device_scalar is necessary. For example, changing just one line in cpp/include/cudf/reduction/detail/reduction.cuh led to linker errors involving timestamp_scalar and chrono_scalar.
[Update Again]: We decided to improve this down to RMM: https://github.com/rapidsai/rmm/pull/1996 . Thus, we could ignore this ugly symbol lookup error for now.
- [Tracked in #18980] A "nice-looking" modification to
hostdevice_vector::elementcauses a segmentation fault during compilation.
Details folded
The only working approach I’ve found so far is: https://github.com/rapidsai/cudf/pull/18968/files#diff-a97b18b281b557fd56b44267cd57cda53be25f7073a6a86b34c083d88658ac64R117-R126
The non-working "nice-looking" approach I initially considered was:
[[nodiscard]] T element(std::size_t element_index, rmm::cuda_stream_view stream) const
{
// copy d_data.element[element_index] via host pinned memory
return make_pinned_vector<T>(cudf::device_span<T>{d_data.element_ptr(element_index), 1}, stream).front();
}
Hi @JigaoLuo, thank you for this patch! I'm working my way through the changes.
Have you seen any performance impact from these changes?
A "nice-looking" modification to hostdevice_vector::element causes a segmentation fault during compilation. The only working approach I’ve found so far is: https://github.com/rapidsai/cudf/pull/18968/files#diff-a97b18b281b557fd56b44267cd57cda53be25f7073a6a86b34c083d88658ac64R117-R126
@JigaoLuo Thanks for raising this. I think I have also seen this compiler segfault when there's a cudf::make_host_vector(device_span()) in code. ~Is it possible for you to open a separate issue with an isolated diff that would cause it?~
Created issue #18980
Hi @vuule @mhaseeb123 @wence- @GregoryKimball, Thanks for the review and discussion! I think there is still a need for discussion. After that, I’ll address the feedback in separate PRs after we finalize the details. I’ll definitely make time to merge this in my free time.
~~I’ll also try to explore performance optimizations in benchmarks.~~ [Update: you can find the benchmark in the following message]
Hi @vuule @mhaseeb123 @wence- @GregoryKimball, I’ve run preliminary benchmarks, and as expected, the PR draft shows significant :rocket: gains in scenarios with frequent synchronization stalls (“mis-sync”). This is particularly evident when each thread reads multiple small-sized, concurrent file segments: a use case, for example, each thread reading different row groups from a large Parquet file. If I can convince you with this use case, then we are one step closer to a SpeedOfLight Parquet reader.
I plotted my benchmark as the style from Greg's comment https://github.com/rapidsai/cudf/issues/15620#issuecomment-2159045929 . The PR draft is fully compilable, so feel free to test it when you have time.
You can also find my simple command and details here:
How many miss-sync are saved?
You will see where this speedup comes from and how nasty the miss-sync can be:
Upstream
$ $ nsys export --output report_upstream_8threads.sqlite --type sqlite report_upstream_8threads.nsys-rep
$ nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
166861
This PR
$ nsys analyze -r cuda_memcpy_async:rows=-1 report_pr_8threads.sqlite | wc -l
301
And all those miss-sync are coming from the Parquet writer. No pageable memcpy exists in the Parquet reader with this patch.
Command
With num_iterations=10, I let each thread read 10 times of size 128MB to mimic the use case and also to create more miss-sync.
for t in 1 2 4 8 16 32 64 128;
do
./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=$((1024 * 1024 * 128 * t)) --axis num_threads=$t --axis num_iterations=10 --csv <PATH>;
done
And I use this same command on this PR draft as well as on the upstream branch 25-08 to generate two sets of CSV files.
Hardware setup
RMM memory resource = pool
CUIO host memory resource = pinned_pool
# Devices
## [0] `NVIDIA A100-SXM4-40GB`
* SM Version: 800 (PTX Version: 800)
* Number of SMs: 108
* SM Default Clock Rate: 1410 MHz
* Global Memory: 19704 MiB Free / 40339 MiB Total
* Global Memory Bus Peak: 1555 GB/sec (5120-bit DDR @1215MHz)
* Max Shared Memory: 164 KiB/SM, 48 KiB/Block
* L2 Cache Size: 40960 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: Yes
(I realized some of my code review comments weren’t successfully submitted—sorry for the confusion.)
Note: There may be a bug in this draft or somewhere in the codebase. I ran millions of read operations over a 10-hour period—purely reading—and encountered a single instance of incorrect results, with an estimated trigger rate of just 0.0001%. I can not be sure where the bug is: is it on my draft or in the cudf code (I am using branch-25.08 still).
I’m leaving this note here as a reminder for myself, and we can consider adding more checks during the merging for the rest of this draft.
[Selfnote Update] I’ve observed the same bug even without the patch. The reproduction now seems tied to my metadata caching PR. I’ll allocate time to reproduce it more reliably.