te42kyfo
te42kyfo
Your results look absolutely in line with what I had measured myself before. Regarding the very high numbers in the beginning: Initially, for the first few dataset sizes, there is...
The used parameters (256kB) had been fine before, but doesn't work as well for the increased L1 cache in H100. The CL replacement strategy might also have changed.
This is not due to the read-write ratio, but because of the amount of memory parallelism. On the H200, the memory interface is so wide, that even at full occupancy,...
A single thread would work just the same. Using a full warp just feels better, and a full warp is 64 threads on CDNA hardware. On NVIDIA, this actually runs...
The number of thread blocks needs to be a divisor of N, which is a template parameter to measure. Otherwise many threads will do too much work. In lines 144...