Read bandwidth not match
GPU H200 rootgpu-stream# ./cuda-stream block smBlocks threads occ% | init read scale triad 3pt 5pt 16 2112 1 0.8% | GB/s: 76 32 62 109 59 58 32 4224 1 1.6% | GB/s: 150 63 123 218 113 111 48 6336 1 2.3% | GB/s: 224 93 176 305 168 165 64 8448 1 3.1% | GB/s: 298 124 236 409 219 215 80 10560 1 3.9% | GB/s: 370 152 283 488 273 268 96 12672 1 4.7% | GB/s: 442 182 343 589 321 315 112 14784 1 5.5% | GB/s: 514 209 386 658 373 366 64 16896 2 6.2% | GB/s: 584 247 463 795 429 421 160 21120 1 7.8% | GB/s: 727 296 540 914 509 501 96 25344 2 9.4% | GB/s: 874 369 663 1121 623 612 128 33792 2 12.5% | GB/s: 1149 489 864 1455 803 789 160 42240 2 15.6% | GB/s: 1421 591 1013 1703 962 949 192 50688 2 18.8% | GB/s: 1694 706 1172 1974 1121 1102 224 59136 2 21.9% | GB/s: 1949 810 1320 2226 1269 1248 256 67584 2 25.0% | GB/s: 2204 942 1477 2474 1411 1387 288 76032 2 28.1% | GB/s: 2410 1014 1594 2670 1540 1514 320 84480 2 31.2% | GB/s: 2623 1121 1722 2857 1667 1639 352 92928 2 34.4% | GB/s: 2824 1211 1842 3031 1786 1756 384 101376 2 37.5% | GB/s: 3032 1318 1968 3197 1905 1874 416 109824 2 40.6% | GB/s: 3191 1394 2070 3323 2008 1976 448 118272 2 43.8% | GB/s: 3414 1484 2175 3446 2111 2077 480 126720 2 46.9% | GB/s: 3555 1565 2275 3558 2210 2174 512 135168 2 50.0% | GB/s: 3733 1695 2392 3674 2318 2275 544 143616 2 53.1% | GB/s: 3891 1731 2464 3759 2396 2358 576 152064 2 56.2% | GB/s: 4045 1820 2553 3843 2478 2438 608 160512 2 59.4% | GB/s: 4150 1898 2642 3925 2576 2531 640 168960 2 62.5% | GB/s: 4279 1993 2732 3993 2665 2621 672 177408 2 65.6% | GB/s: 4475 2053 2812 4043 2742 2699 704 185856 2 68.8% | GB/s: 4592 2132 2892 4091 2821 2776 736 194304 2 71.9% | GB/s: 4631 2205 2969 4139 2897 2853 768 202752 2 75.0% | GB/s: 4687 2297 3046 4204 2971 2927 800 211200 2 78.1% | GB/s: 4703 2339 3111 4274 3039 2992 832 219648 2 81.2% | GB/s: 4705 2404 3180 4339 3102 3061 864 228096 2 84.4% | GB/s: 4705 2469 3246 4385 3167 3125 896 236544 2 87.5% | GB/s: 4705 2543 3310 4420 3229 3189 928 244992 2 90.6% | GB/s: 4705 2596 3366 4444 3287 3245 960 253440 2 93.8% | GB/s: 4705 2660 3419 4461 3342 3301 992 261888 2 96.9% | GB/s: 4704 2723 3469 4472 3390 3351 1024 270336 2 100.0% | GB/s: 4698 2779 3505 4478 3403 3354
This is not due to the read-write ratio, but because of the amount of memory parallelism.
On the H200, the memory interface is so wide, that even at full occupancy, every thread waiting for data, that is not enough to saturate the memory interface. READ just reads 8B per thread. SCALE transfers 16B per thread, and triad 32B, so these kernels have a higher memory parallelism per thread. It is possible to implement the READ kernel so that each thread computes multiple array indices, and consequently has higher memory parallelism, but I deliberately kept it that way because that is what applications are practically doing.
STORE is fast because stores don't have memory latency.