Copy engine vs SM mode
I find in SM Mode(for example, write a copy kernel), memcpy's bandwidth is higher than copy engine(cudaMemcpy) in H20. 3.3T/s vs 2T/s, is this reasonable?
The CUDA Samples are intended for illustration and functional use, but are NOT intended for serious performance testing.
We'd recommend you use the NVBandwidth code to do performance testing.
The CUDA Samples are intended for illustration and functional use, but are NOT intended for serious performance testing.
We'd recommend you use the NVBandwidth code to do performance testing.
I just observe it on my nsight system profile, and found cuda D2D elapsed time is much higher than my handwritter kernels, is this reasonable? I only know cudaMemcpy do not use any SM resource