cuda-samples icon indicating copy to clipboard operation
cuda-samples copied to clipboard

Copy engine vs SM mode

Open MARD1NO opened this issue 8 months ago • 2 comments

I find in SM Mode(for example, write a copy kernel), memcpy's bandwidth is higher than copy engine(cudaMemcpy) in H20. 3.3T/s vs 2T/s, is this reasonable?

MARD1NO avatar May 09 '25 07:05 MARD1NO

The CUDA Samples are intended for illustration and functional use, but are NOT intended for serious performance testing.

We'd recommend you use the NVBandwidth code to do performance testing.

jnbntz avatar May 27 '25 18:05 jnbntz

The CUDA Samples are intended for illustration and functional use, but are NOT intended for serious performance testing.

We'd recommend you use the NVBandwidth code to do performance testing.

I just observe it on my nsight system profile, and found cuda D2D elapsed time is much higher than my handwritter kernels, is this reasonable? I only know cudaMemcpy do not use any SM resource

MARD1NO avatar May 28 '25 02:05 MARD1NO