cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading

Open JigaoLuo opened this issue 9 months ago • 7 comments

Description

For the issue https://github.com/rapidsai/cudf/issues/18967, this is a PR Draft aimed at removing all unnecessary synchronization points (termed "miss-sync") in the Parquet reader. Please hold off on merging this PR draft. The plan is to split it into smaller PRs for the actual merge.

TL;DR: This is the performance gain in scalability :rocket: once the future small PRs from this draft are merged: image

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [ ] New or existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.

JigaoLuo avatar May 26 '25 22:05 JigaoLuo

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar May 26 '25 22:05 copy-pr-bot[bot]

Outcome

With the current code in this PR Draft, reading parquet has no memcpy on pageable memory. Here’s the outcome of this PR draft, comparing the number of "miss-sync" against the latest branch-25-08:

Comparison: Expert Systems View in Nsys:

In this PR draft:

$ nsys analyze -r cuda_memcpy_async "/tmp/no-misssync.sqlite"
Processing [/tmp/no-misssync.sqlite] with [/opt/nsight-systems-20250101/host-linux-x64/rules/cuda_memcpy_async.py]... 

 ** CUDA Async Memcpy with Pageable Memory (cuda_memcpy_async):

There were no problems detected related to memcpy operations using pageable
memory.

In the latest branch-25-08:

$ nsys analyze -r cuda_memcpy_async "/tmp/branch2508.sqlite"
Processing [/tmp/branch2508.sqlite] with [/opt/nsight-systems-20250101/host-linux-x64/rules/cuda_memcpy_async.py]... 

 ** CUDA Async Memcpy with Pageable Memory (cuda_memcpy_async):

The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy
operations to block and be executed synchronously. This leads to low GPU
utilization.

Suggestion: If applicable, use PINNED memory instead.

 Duration (ns)    Start (ns)    Src Kind  Dst Kind  Bytes (MB)     PID     Device ID  Context ID  Green Context ID  Stream ID           API Name         
 -------------  --------------  --------  --------  ----------  ---------  ---------  ----------  ----------------  ---------  --------------------------
         6,432   9,395,999,168  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         3,456   9,080,770,580  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         3,232   9,679,481,106  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         3,168   9,081,092,563  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,592   9,084,669,960  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,432   9,082,044,304  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,432   9,084,155,818  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,432   9,084,374,089  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,400   9,084,414,249  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,400   9,084,450,409  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,400   9,084,542,632  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,400   9,084,602,664  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,400   9,084,908,807  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,400   9,956,004,698  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,368   9,081,670,354  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,368   9,084,627,336  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,368   9,085,019,847  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,368   9,955,974,299  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,272   9,365,887,457  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,272   9,366,435,775  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,240   9,679,508,786  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,208   9,956,095,418  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,112   9,365,906,657  Device    Pageable       0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         2,080   9,366,408,511  Device    Pageable       0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         2,080   9,395,470,882  Pageable  Device         0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         1,888   9,391,287,568  Pageable  Device         0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         1,088   9,395,902,593  Pageable  Device         0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         1,088   9,400,020,212  Pageable  Device         0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         1,056  10,200,105,651  Pageable  Device         0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         1,024  10,200,112,371  Pageable  Device         0.000  1,521,787          0           1                           13  cudaMemcpyAsync_ptsz_v7000
         1,024  10,366,822,916  Pageable  Device         0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000
         1,024  10,366,834,276  Pageable  Device         0.000  1,521,787          0           1                           14  cudaMemcpyAsync_ptsz_v7000

The workload is only read a TPC-H SF10 Lineitem via 2 threads: nsys profile ./parquet_io_multithreaded lineitem-sf10.parquet 1 FILEPATH 1 2


Goal&Fix: No miss-sync

The core goal of this PR Draft is to eliminate pageable-memcpy and miss-sync in the parquet reading path, enabling seamless multi-streaming and pipelining without synchronization bottlenecks. While my initial private implementation focused on pure in-place fixes for performance only, this PR has a refactored code. This involves modifying underlying header files (e.g., cpp/include/cudf/detail/utilities/vector_factories.hpp and cpp/include/cudf/detail/utilities/cuda_memcpy.hpp ) to address issues at their root, rather than applying isolated fixes within the Parquet codebase. I acknowledge that this refactored style may complicate code review and require enhanced test coverage, which I’m prepared to address based on feedback. Please advise whether you prefer the current refactored approach or a more localized in-place fix style.

The changes also raise considerations for unit testing. Since the in-place fix style alters only internal Parquet functions, it might require minimal new testing, whereas the refactored approach, by modifying shared headers as vector_factories.hpp, would require additional test cases. I will defer detailed unit tests until the first code-style discussion is finalized.


Takeaway

You can read the long version of takeaway here: https://github.com/rapidsai/cudf/issues/18967

As a general takeaway for fixing miss-sync:

  • Avoid thrust::reduce and some similar thrust functions; use CUB primitives.
  • Replace rmm::device_scalar with cudf::detail::device_scalar for its bounce buffers for efficient data copying.
  • Eliminate rmm::device_vector::element(...) calls as it relies on pageable memory (a limitation best resolved within RMM itself).
  • (I will update this list as further insights emerge during the review process.)

Let me know your thoughts on the approach, and I’ll adjust the PR accordingly.

JigaoLuo avatar May 26 '25 22:05 JigaoLuo

Here are the remaining technical questions from my side (I left them them there instead of in code review):

  1. When replacing rmm::device_scalar with cudf::detail::device_scalar in only cpp/include/cudf/reduction/detail/reduction.cuh, the code compiles but fails at linking with undefined symbol errors:
/home/jluo/cudf-dev/cpp/examples/parquet_io/build/./parquet_io_multithreaded: symbol lookup error: /home/jluo/miniconda3/envs/cudf_dev/lib/libcudf.so: undefined symbol: _ZN4cudf16timestamp_scalarIN4cuda3std3__46chrono10time_pointINS4_12system_clockENS4_8durationIlNS3_5ratioILl1ELl1000EEEEEEEEC1INS_6detail13device_scalarISB_EEEERKT_bN3rmm16cuda_stream_viewENS1_2mr3__418basic_resource_refILNSN_10_AllocTypeE1EJNSN_17device_accessibleEEEE

[Update & Self-Note]: This indicates that a partial replacement is insufficient. A comprehensive replacement from rmm::device_scalar to cudf::detail::device_scalar is necessary. For example, changing just one line in cpp/include/cudf/reduction/detail/reduction.cuh led to linker errors involving timestamp_scalar and chrono_scalar.

[Update Again]: We decided to improve this down to RMM: https://github.com/rapidsai/rmm/pull/1996 . Thus, we could ignore this ugly symbol lookup error for now.

  1. [Tracked in #18980] A "nice-looking" modification to hostdevice_vector::element causes a segmentation fault during compilation.
Details folded

The only working approach I’ve found so far is: https://github.com/rapidsai/cudf/pull/18968/files#diff-a97b18b281b557fd56b44267cd57cda53be25f7073a6a86b34c083d88658ac64R117-R126

The non-working "nice-looking" approach I initially considered was:

  [[nodiscard]] T element(std::size_t element_index, rmm::cuda_stream_view stream) const
  {
    // copy d_data.element[element_index] via host pinned memory
    return make_pinned_vector<T>(cudf::device_span<T>{d_data.element_ptr(element_index), 1}, stream).front();
  }

JigaoLuo avatar May 26 '25 22:05 JigaoLuo

Hi @JigaoLuo, thank you for this patch! I'm working my way through the changes.

Have you seen any performance impact from these changes?

vuule avatar May 27 '25 23:05 vuule

A "nice-looking" modification to hostdevice_vector::element causes a segmentation fault during compilation. The only working approach I’ve found so far is: https://github.com/rapidsai/cudf/pull/18968/files#diff-a97b18b281b557fd56b44267cd57cda53be25f7073a6a86b34c083d88658ac64R117-R126

@JigaoLuo Thanks for raising this. I think I have also seen this compiler segfault when there's a cudf::make_host_vector(device_span()) in code. ~Is it possible for you to open a separate issue with an isolated diff that would cause it?~

Created issue #18980

mhaseeb123 avatar May 28 '25 01:05 mhaseeb123

Hi @vuule @mhaseeb123 @wence- @GregoryKimball, Thanks for the review and discussion! I think there is still a need for discussion. After that, I’ll address the feedback in separate PRs after we finalize the details. I’ll definitely make time to merge this in my free time.

~~I’ll also try to explore performance optimizations in benchmarks.~~ [Update: you can find the benchmark in the following message]

JigaoLuo avatar May 28 '25 10:05 JigaoLuo

Hi @vuule @mhaseeb123 @wence- @GregoryKimball, I’ve run preliminary benchmarks, and as expected, the PR draft shows significant :rocket: gains in scenarios with frequent synchronization stalls (“mis-sync”). This is particularly evident when each thread reads multiple small-sized, concurrent file segments: a use case, for example, each thread reading different row groups from a large Parquet file. If I can convince you with this use case, then we are one step closer to a SpeedOfLight Parquet reader.

I plotted my benchmark as the style from Greg's comment https://github.com/rapidsai/cudf/issues/15620#issuecomment-2159045929 . The PR draft is fully compilable, so feel free to test it when you have time.

image

You can also find my simple command and details here:

How many miss-sync are saved?

You will see where this speedup comes from and how nasty the miss-sync can be:

Upstream

$ $ nsys export --output report_upstream_8threads.sqlite --type sqlite report_upstream_8threads.nsys-rep 
$  nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
166861

This PR

$ nsys analyze -r cuda_memcpy_async:rows=-1 report_pr_8threads.sqlite | wc -l
301

And all those miss-sync are coming from the Parquet writer. No pageable memcpy exists in the Parquet reader with this patch.

Command

With num_iterations=10, I let each thread read 10 times of size 128MB to mimic the use case and also to create more miss-sync.

for t in 1 2 4 8 16 32 64 128; 
do 
  ./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=$((1024 * 1024 * 128 * t)) --axis num_threads=$t --axis num_iterations=10 --csv <PATH>;
done

And I use this same command on this PR draft as well as on the upstream branch 25-08 to generate two sets of CSV files.

Hardware setup

RMM memory resource = pool
CUIO host memory resource = pinned_pool
# Devices

## [0] `NVIDIA A100-SXM4-40GB`
* SM Version: 800 (PTX Version: 800)
* Number of SMs: 108
* SM Default Clock Rate: 1410 MHz
* Global Memory: 19704 MiB Free / 40339 MiB Total
* Global Memory Bus Peak: 1555 GB/sec (5120-bit DDR @1215MHz)
* Max Shared Memory: 164 KiB/SM, 48 KiB/Block
* L2 Cache Size: 40960 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: Yes

JigaoLuo avatar May 29 '25 16:05 JigaoLuo

(I realized some of my code review comments weren’t successfully submitted—sorry for the confusion.)

JigaoLuo avatar Aug 02 '25 19:08 JigaoLuo

Note: There may be a bug in this draft or somewhere in the codebase. I ran millions of read operations over a 10-hour period—purely reading—and encountered a single instance of incorrect results, with an estimated trigger rate of just 0.0001%. I can not be sure where the bug is: is it on my draft or in the cudf code (I am using branch-25.08 still).

I’m leaving this note here as a reminder for myself, and we can consider adding more checks during the merging for the rest of this draft.

[Selfnote Update] I’ve observed the same bug even without the patch. The reproduction now seems tied to my metadata caching PR. I’ll allocate time to reproduce it more reliably.

JigaoLuo avatar Sep 21 '25 10:09 JigaoLuo