te42kyfo

Results 15 comments of te42kyfo

I observed the same, on MI100, (TCC_HIT_sum + TCC_MISS_sum) * 32 matched the expected L2 cache data volume. On MI210, this expressions results in exactly half of what is expected....

At 1024*1024 elements the total data volume per array is 1024*1024 * sizeof(double) = 8MB. The 4080Ti has 32MB of L2 cache, so even the triad test, that uses 3...

As far as I could google, the A800 is made from two GA100 chips, each of which has 40MB of cache. You should also be able to query that as...

You are right, my info was faulty. The A800 is just one model based on the GA100 chip, which has 40MB L2. I just googled really quickly because I haven't...

First of all, if you want to see more data, you can uncomment the lines 81-91: ``` measureDRAMBytesStart(); callKernel(blockCount, blockRun); auto metrics = measureDRAMBytesStop(); dram_read.add(metrics[0]); dram_write.add(metrics[1]); measureL2BytesStart(); callKernel(blockCount, blockRun); metrics...

That's a fun one! `localSum += B[idx]:` results in assembly like this ([godbolt](https://godbolt.org/z/P8qKE4bsa), N=4 to make it shorter): ```LDG.E.64.CONSTANT R6, [R4.64] LDG.E.64.CONSTANT R10, [R8.64] LDG.E.64.CONSTANT R14, [R12.64] LDG.E.64.CONSTANT R16, [R16.64]...

Thank you for your feedback. I have so far only tried gfx90a and gfx1030 targets, as these are the ones I have available. The error is somewhere in the performance...

I tested this on a machine with a RX6900XT. When I use your build command line, it fails for me with the same error. If I uses the one from...

Can you please verify that this actually fixes your problem? Also, like I have said, I would be interested in your results.

I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the...