Copy engine vs SM mode

Open MARD1NO opened this issue 8 months ago • 2 comments

I find in SM Mode(for example, write a copy kernel), memcpy's bandwidth is higher than copy engine(cudaMemcpy) in H20. 3.3T/s vs 2T/s, is this reasonable?

May 09 '25 07:05 MARD1NO

The CUDA Samples are intended for illustration and functional use, but are NOT intended for serious performance testing.

We'd recommend you use the NVBandwidth code to do performance testing.

May 27 '25 18:05 jnbntz

The CUDA Samples are intended for illustration and functional use, but are NOT intended for serious performance testing.

We'd recommend you use the NVBandwidth code to do performance testing.

I just observe it on my nsight system profile, and found cuda D2D elapsed time is much higher than my handwritter kernels, is this reasonable? I only know cudaMemcpy do not use any SM resource

May 28 '25 02:05 MARD1NO