cccl [PERF][BUG]: Thrust uses cudaMemcpy for Device->Device copies (66% SoL on H200)

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Performance

Component

Thrust

Describe the bug

thrust::copy uses cudaMemcpy to implement the copy, which saturates at most 66% of memory bandwidth on H200. nvbug 4207603

How to Reproduce

See godbolt link for exact reproducer.

Observed output:

$ ./01_thrust_copy 
     cp_gb  elapsed_ms     bw_gbps  pct_of_sol
      8.59      2.6090      3292.4       67.0%
      8.59      2.6073      3294.5       67.0%
      8.59      2.6061      3296.1       67.0%

Expected behavior

thrust::copy should be able to saturate bandwidth.

Reproduction link

https://godbolt.org/z/foPG4ox53

Operating System

No response

nvidia-smi output

$ nvidia-smi 
Mon Apr 29 05:40:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:45:00.0 Off |                    0 |
| N/A   27C    P0             73W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

NA

Apr 29 '24 12:04 ahendriksen

Related issue in RAPIDS, where smaller copies are serialized behind larger copies due to busy copy engines.

@gevtushenko : does thrust::copy_n use a kernel to perform the copying? Perhaps, that should be used instead.

May 02 '24 10:05 ahendriksen

I just ran the Thrust benchmark for copy on my A6000 and the current, cudaMemcpy-based implementation performs "well enough":

T{ct}	Elements	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	GlobalMem BW	BWUtil
U8	2^28 = 268435456	558x	782.671 us	11.17%	773.783 us	8.30%	346.913G	693.826 GB/s	90.33%
U16	2^28 = 268435456	608x	1.522 ms	2.59%	1.518 ms	2.36%	176.841G	707.363 GB/s	92.09%
U32	2^28 = 268435456	820x	3.050 ms	4.37%	3.045 ms	4.18%	88.167G	705.332 GB/s	91.83%
U64	2^28 = 268435456	854x	6.065 ms	2.53%	6.061 ms	2.52%	44.288G	708.610 GB/s	92.26%
NonTrivial	2^28 = 268435456	1200x	6.168 ms	3.17%	6.163 ms	3.16%	43.554G	696.871 GB/s	90.73%

However, I saw the <66% on H200 a few days ago, so there a kernel is probably the better choice. Are there any upsides with using cudaMemcpy? I could assume if the device is busy with other work, using the copy engines could result in less contention for SMs and better overall application throughput. I am therefore wondering whether we need to give users a knob to choose which copy implementation is used.

Given the performance on A6000 looks fine, we may also want to dispatch between cudaMemcpy and a kernel depending on the GPU we are running on.

@gevtushenko : does thrust::copy_n use a kernel to perform the copying? Perhaps, that should be used instead.

thrust::copy_n uses the same implementation as thrust::copy and will also use cudaMemcpyAsync when possible, and otherwise thrust::transform.

Jul 15 '24 09:07 bernhardmgruber

I discussed this with @gevtushenko yesterday and he remembers a time where we actually had a custom kernel for thrust::copy, but switched to cudaMemcpy because the latter was faster. We want to avoid ping-ponging between a custom kernel and cudaMemcpy ourselves, and rather prefer to let the team behind cudaMemcpy handle this, for which you already opened a bug report.

This does not mean we could not make an exception still, but rather that we are trying to address more pressing issues, e.g. #1673, first and see how cudaMemcpy develops.

Jul 17 '24 12:07 bernhardmgruber

This is fixed. Should be visible in a future public release. Please see nvbug 4207603

Jul 19 '24 15:07 ahendriksen

Great! I will close the issue then, since no further action is necessary from our side. Feel free to reopen it if the problem is not resolved once the fixes land!

Jul 22 '24 11:07 bernhardmgruber