[PERF][BUG]: Thrust uses cudaMemcpy for Device->Device copies (66% SoL on H200)
Is this a duplicate?
- [X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct
Type of Bug
Performance
Component
Thrust
Describe the bug
thrust::copy uses cudaMemcpy to implement the copy, which saturates at most 66% of memory bandwidth on H200.
nvbug 4207603
How to Reproduce
See godbolt link for exact reproducer.
Observed output:
$ ./01_thrust_copy
cp_gb elapsed_ms bw_gbps pct_of_sol
8.59 2.6090 3292.4 67.0%
8.59 2.6073 3294.5 67.0%
8.59 2.6061 3296.1 67.0%
Expected behavior
thrust::copy should be able to saturate bandwidth.
Reproduction link
https://godbolt.org/z/foPG4ox53
Operating System
No response
nvidia-smi output
$ nvidia-smi
Mon Apr 29 05:40:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 On | 00000000:45:00.0 Off | 0 |
| N/A 27C P0 73W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
NVCC version
NA
Related issue in RAPIDS, where smaller copies are serialized behind larger copies due to busy copy engines.
@gevtushenko : does thrust::copy_n use a kernel to perform the copying? Perhaps, that should be used instead.
I just ran the Thrust benchmark for copy on my A6000 and the current, cudaMemcpy-based implementation performs "well enough":
| T{ct} | Elements | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil |
|---|---|---|---|---|---|---|---|---|---|
| U8 | 2^28 = 268435456 | 558x | 782.671 us | 11.17% | 773.783 us | 8.30% | 346.913G | 693.826 GB/s | 90.33% |
| U16 | 2^28 = 268435456 | 608x | 1.522 ms | 2.59% | 1.518 ms | 2.36% | 176.841G | 707.363 GB/s | 92.09% |
| U32 | 2^28 = 268435456 | 820x | 3.050 ms | 4.37% | 3.045 ms | 4.18% | 88.167G | 705.332 GB/s | 91.83% |
| U64 | 2^28 = 268435456 | 854x | 6.065 ms | 2.53% | 6.061 ms | 2.52% | 44.288G | 708.610 GB/s | 92.26% |
| NonTrivial | 2^28 = 268435456 | 1200x | 6.168 ms | 3.17% | 6.163 ms | 3.16% | 43.554G | 696.871 GB/s | 90.73% |
However, I saw the <66% on H200 a few days ago, so there a kernel is probably the better choice. Are there any upsides with using cudaMemcpy? I could assume if the device is busy with other work, using the copy engines could result in less contention for SMs and better overall application throughput. I am therefore wondering whether we need to give users a knob to choose which copy implementation is used.
Given the performance on A6000 looks fine, we may also want to dispatch between cudaMemcpy and a kernel depending on the GPU we are running on.
@gevtushenko : does
thrust::copy_nuse a kernel to perform the copying? Perhaps, that should be used instead.
thrust::copy_n uses the same implementation as thrust::copy and will also use cudaMemcpyAsync when possible, and otherwise thrust::transform.
I discussed this with @gevtushenko yesterday and he remembers a time where we actually had a custom kernel for thrust::copy, but switched to cudaMemcpy because the latter was faster. We want to avoid ping-ponging between a custom kernel and cudaMemcpy ourselves, and rather prefer to let the team behind cudaMemcpy handle this, for which you already opened a bug report.
This does not mean we could not make an exception still, but rather that we are trying to address more pressing issues, e.g. #1673, first and see how cudaMemcpy develops.
This is fixed. Should be visible in a future public release. Please see nvbug 4207603
Great! I will close the issue then, since no further action is necessary from our side. Feel free to reopen it if the problem is not resolved once the fixes land!