gpu-benches Read bandwidth not match

Jan 14 '25 03:01 yulingding

rootgpu-stream# ./cuda-stream block smBlocks threads occ% | 16 2112 1 0.8% 32 4224 1 1.6% 48 6336 1 2.3% 64 8448 1 3.1% 80 10560 1 3.9% 96 12672 1 4.7% 112 14784 1 5.5% 64 16896 2 6.2% 160 21120 1 7.8% 96 25344 2 9.4% 128 33792 2 12.5% 160 42240 2 15.6% 192 50688 2 18.8% 224 59136 2 21.9% 256 67584 2 25.0% 288 76032 2 28.1% 320 84480 2 31.2% 352 92928 2 34.4% 384 101376 2 37.5% 416 109824 2 40.6% 448 118272 2 43.8% 480 126720 2 46.9% 512 135168 2 50.0% 544 143616 2 53.1% 576 152064 2 56.2% 608 160512 2 59.4% 640 168960 2 62.5% 672 177408 2 65.6% 704 185856 2 68.8% 736 194304 2 71.9% 768 202752 2 75.0% 800 211200 2 78.1% 832 219648 2 81.2% 864 228096 2 84.4% 896 236544 2 87.5% 928 244992 2 90.6% 960 253440 2 93.8% 992 261888 2 96.9% 1024 270336 2 100.0%

GPU H200 init read scale triad 3pt 5pt | GB/s: 76 32 62 109 59 58 | GB/s: 150 63 123 218 113 111 | GB/s: 224 93 176 305 168 165 | GB/s: 298 124 236 409 219 215 | GB/s: 370 152 283 488 273 268 | GB/s: 442 182 343 589 321 315 | GB/s: 514 209 386 658 373 366 | GB/s: 584 247 463 795 429 421 | GB/s: 727 296 540 914 509 501 | GB/s: 874 369 663 1121 623 612 | GB/s: 1149 489 864 1455 803 789 | GB/s: 1421 591 1013 1703 962 949 | GB/s: 1694 706 1172 1974 1121 1102 | GB/s: 1949 810 1320 2226 1269 1248 | GB/s: 2204 942 1477 2474 1411 1387 | GB/s: 2410 1014 1594 2670 1540 1514 | GB/s: 2623 1121 1722 2857 1667 1639 | GB/s: 2824 1211 1842 3031 1786 1756 | GB/s: 3032 1318 1968 3197 1905 1874 | GB/s: 3191 1394 2070 3323 2008 1976 | GB/s: 3414 1484 2175 3446 2111 2077 | GB/s: 3555 1565 2275 3558 2210 2174 | GB/s: 3733 1695 2392 3674 2318 2275 | GB/s: 3891 1731 2464 3759 2396 2358 | GB/s: 4045 1820 2553 3843 2478 2438 | GB/s: 4150 1898 2642 3925 2576 2531 | GB/s: 4279 1993 2732 3993 2665 2621 | GB/s: 4475 2053 2812 4043 2742 2699 | GB/s: 4592 2132 2892 4091 2821 2776 | GB/s: 4631 2205 2969 4139 2897 2853 | GB/s: 4687 2297 3046 4204 2971 2927 | GB/s: 4703 2339 3111 4274 3039 2992 | GB/s: 4705 2404 3180 4339 3102 3061 | GB/s: 4705 2469 3246 4385 3167 3125 | GB/s: 4705 2543 3310 4420 3229 3189 | GB/s: 4705 2596 3366 4444 3287 3245 | GB/s: 4705 2660 3419 4461 3342 3301 | GB/s: 4704 2723 3469 4472 3390 3351 | GB/s: 4698 2779 3505 4478 3403 3354

Jan 14 '25 08:01 yulingding

This is not due to the read-write ratio, but because of the amount of memory parallelism.

On the H200, the memory interface is so wide, that even at full occupancy, every thread waiting for data, that is not enough to saturate the memory interface. READ just reads 8B per thread. SCALE transfers 16B per thread, and triad 32B, so these kernels have a higher memory parallelism per thread. It is possible to implement the READ kernel so that each thread computes multiple array indices, and consequently has higher memory parallelism, but I deliberately kept it that way because that is what applications are practically doing.

STORE is fast because stores don't have memory latency.

Jan 20 '25 15:01 te42kyfo