cccl icon indicating copy to clipboard operation
cccl copied to clipboard

[PERF][BUG]: Thrust uses cudaMemcpy for Device->Device copies (66% SoL on H200)

Open ahendriksen opened this issue 1 year ago • 1 comments

Is this a duplicate?

  • [X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Performance

Component

Thrust

Describe the bug

thrust::copy uses cudaMemcpy to implement the copy, which saturates at most 66% of memory bandwidth on H200. nvbug 4207603

How to Reproduce

See godbolt link for exact reproducer.

Observed output:

$ ./01_thrust_copy 
     cp_gb  elapsed_ms     bw_gbps  pct_of_sol
      8.59      2.6090      3292.4       67.0%
      8.59      2.6073      3294.5       67.0%
      8.59      2.6061      3296.1       67.0%

Expected behavior

thrust::copy should be able to saturate bandwidth.

Reproduction link

https://godbolt.org/z/foPG4ox53

Operating System

No response

nvidia-smi output

$ nvidia-smi 
Mon Apr 29 05:40:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:45:00.0 Off |                    0 |
| N/A   27C    P0             73W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

NA

ahendriksen avatar Apr 29 '24 12:04 ahendriksen

Related issue in RAPIDS, where smaller copies are serialized behind larger copies due to busy copy engines.

@gevtushenko : does thrust::copy_n use a kernel to perform the copying? Perhaps, that should be used instead.

ahendriksen avatar May 02 '24 10:05 ahendriksen

I just ran the Thrust benchmark for copy on my A6000 and the current, cudaMemcpy-based implementation performs "well enough":

T{ct} Elements Samples CPU Time Noise GPU Time Noise Elem/s GlobalMem BW BWUtil
U8 2^28 = 268435456 558x 782.671 us 11.17% 773.783 us 8.30% 346.913G 693.826 GB/s 90.33%
U16 2^28 = 268435456 608x 1.522 ms 2.59% 1.518 ms 2.36% 176.841G 707.363 GB/s 92.09%
U32 2^28 = 268435456 820x 3.050 ms 4.37% 3.045 ms 4.18% 88.167G 705.332 GB/s 91.83%
U64 2^28 = 268435456 854x 6.065 ms 2.53% 6.061 ms 2.52% 44.288G 708.610 GB/s 92.26%
NonTrivial 2^28 = 268435456 1200x 6.168 ms 3.17% 6.163 ms 3.16% 43.554G 696.871 GB/s 90.73%

However, I saw the <66% on H200 a few days ago, so there a kernel is probably the better choice. Are there any upsides with using cudaMemcpy? I could assume if the device is busy with other work, using the copy engines could result in less contention for SMs and better overall application throughput. I am therefore wondering whether we need to give users a knob to choose which copy implementation is used.

Given the performance on A6000 looks fine, we may also want to dispatch between cudaMemcpy and a kernel depending on the GPU we are running on.

@gevtushenko : does thrust::copy_n use a kernel to perform the copying? Perhaps, that should be used instead.

thrust::copy_n uses the same implementation as thrust::copy and will also use cudaMemcpyAsync when possible, and otherwise thrust::transform.

bernhardmgruber avatar Jul 15 '24 09:07 bernhardmgruber

I discussed this with @gevtushenko yesterday and he remembers a time where we actually had a custom kernel for thrust::copy, but switched to cudaMemcpy because the latter was faster. We want to avoid ping-ponging between a custom kernel and cudaMemcpy ourselves, and rather prefer to let the team behind cudaMemcpy handle this, for which you already opened a bug report.

This does not mean we could not make an exception still, but rather that we are trying to address more pressing issues, e.g. #1673, first and see how cudaMemcpy develops.

bernhardmgruber avatar Jul 17 '24 12:07 bernhardmgruber

This is fixed. Should be visible in a future public release. Please see nvbug 4207603

ahendriksen avatar Jul 19 '24 15:07 ahendriksen

Great! I will close the issue then, since no further action is necessary from our side. Feel free to reopen it if the problem is not resolved once the fixes land!

bernhardmgruber avatar Jul 22 '24 11:07 bernhardmgruber