te42kyfo comments

Results 15 comments of


                                            te42kyfo

What's the difference between cuda-l2-cache and gpu-cache benchmarks?

Your results look absolutely in line with what I had measured myself before. Regarding the very high numbers in the beginning: Initially, for the first few dataset sizes, there is...

What's the difference between cuda-l2-cache and gpu-cache benchmarks?

The used parameters (256kB) had been fine before, but doesn't work as well for the increased L1 cache in H100. The CL replacement strategy might also have changed.

Read bandwidth not match

This is not due to the read-write ratio, but because of the amount of memory parallelism. On the H200, the memory interface is so wide, that even at full occupancy,...

Why the block size in gpu-latency is 64

A single thread would work just the same. Using a full warp just feels better, and a full warp is 64 threads on CDNA hardware. On NVIDIA, this actually runs...

Why blocksize is 256 in gpu-cache test

The number of thread blocks needs to be a divisor of N, which is a template parameter to measure. Otherwise many threads will do too much work. In lines 144...