gpu-benches
gpu-benches copied to clipboard
Why blocksize is 256 in gpu-cache test
Hey, i find in gpu-cache test the blocksize is 256, why it is not 1024 ?
When i changed blocksize from 256 to 1024, L1 cache bandwidth tested has some improvement and fluctuates more.
blocksize = 256 results as follows
1 kB 50ms 0.7% 8648.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
2 kB 37ms 0.1% 11608.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
3 kB 33ms 0.0% 12947.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
4 kB 31ms 5.4% 14061.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
6 kB 30ms 3.3% 14402.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
8 kB 30ms 6.6% 14989.1 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
10 kB 30ms 3.0% 14555.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
12 kB 30ms 27.9% 15976.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
14 kB 30ms 5.3% 14430.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
16 kB 30ms 2.2% 14588.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
18 kB 33ms 2.0% 13113.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
20 kB 30ms 17.5% 15206.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
22 kB 29ms 7.9% 15610.4 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
24 kB 28ms 11.8% 15916.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
26 kB 32ms 11.1% 13737.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
28 kB 30ms 5.0% 14240.1 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
30 kB 31ms 0.6% 14172.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
32 kB 30ms 4.1% 14733.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
34 kB 29ms 2.2% 14845.4 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
36 kB 29ms 3.3% 15113.0 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
38 kB 29ms 5.4% 14967.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
40 kB 29ms 5.4% 15129.5 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
42 kB 29ms 8.7% 15437.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
44 kB 29ms 7.0% 15451.0 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
46 kB 29ms 8.4% 15633.8 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
48 kB 28ms 12.3% 15940.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
50 kB 28ms 16.4% 16288.1 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
52 kB 28ms 14.6% 16230.0 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
54 kB 28ms 12.6% 16195.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
56 kB 27ms 10.0% 16434.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
58 kB 28ms 11.0% 16433.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
blocksize = 1024 results as follows
data set exec time spread Eff. bw DRAM read DRAM write L2 read L2 store
4 kB 37ms 0.1% 11645.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
6 kB 111ms 0.0% 3902.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
8 kB 29ms 46.0% 17593.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
10 kB 66ms 6.0% 6564.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
12 kB 29ms 24.8% 16609.0 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
14 kB 52ms 1.4% 8303.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
16 kB 28ms 27.1% 17275.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
18 kB 44ms 6.6% 9894.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
20 kB 28ms 27.0% 17521.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
22 kB 39ms 7.5% 11307.5 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
24 kB 27ms 16.9% 17184.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
26 kB 37ms 18.0% 12475.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
28 kB 27ms 40.3% 18542.5 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
30 kB 34ms 11.9% 13365.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
32 kB 26ms 20.7% 18043.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
34 kB 34ms 23.1% 14124.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
36 kB 27ms 26.9% 17707.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
My device is A800 80GB PCIe.
The number of thread blocks needs to be a divisor of N, which is a template parameter to measure<N>. Otherwise many threads will do too much work.
In lines 144 forward, only use multiples of 1024 as template parameter. On some GPUs, which do not have a L1 cache as large, the amount of work per thread would be very small, and the performance actually worse.