nvbench icon indicating copy to clipboard operation
nvbench copied to clipboard

Write-only benchmark exceeds 100% bandwidth

Open bernhardmgruber opened this issue 5 months ago • 2 comments

@gevtushenko and I have run into this curious case: When benchmarking cub::DeviceTransform::Fill to just fill a buffer with values (see source), we sometimes get results like this:

## generate

### [0] NVIDIA GeForce RTX 5090

| T{ct} |   Elements{io}   | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  | GlobalMem BW | BWUtil  | Samples | Batch GPU  |
|-------|------------------|---------|------------|--------|------------|--------|----------|--------------|---------|---------|------------|
|    I8 |     2^16 = 65536 |    314x |   9.475 us | 13.30% |   2.836 us | 34.84% |  23.108G |  23.108 GB/s |   1.29% | 230528x |   2.169 us |
|    I8 |   2^20 = 1048576 |    414x |   9.200 us | 10.08% |   2.683 us | 34.71% | 390.860G | 390.860 GB/s |  21.81% | 235432x |   2.124 us |
|    I8 |  2^24 = 16777216 |    262x |  15.828 us |  4.94% |   8.182 us |  1.94% |   2.050T |   2.050 TB/s | 114.41% | 100812x |   4.961 us |
|    I8 | 2^28 = 268435456 |    370x | 166.190 us |  0.79% | 158.758 us |  0.73% |   1.691T |   1.691 TB/s |  94.35% |   3338x | 158.649 us |

Notice the 114.41% BWUtil for the third run. Trying to figure out what happens, I run the benchmark under ncu and the memory workload analysis shows: Image It looks like the kernel did not write back the data from L2 to GMEM. This may explain why the kernel could finish faster than it would take to do the full write back to GMEM (which is what I guess nvbench assumes).

ncu shows me that smaller runs write back zero bytes from L2 to GMEM, and the larger 2^28 run writes back everything except about 62MB of data, which remains in L2. The experiment was done on an RTX 5090 with 92MB L2.

Image

It seems nvbench should incorporate a mechanism that forces the benchmarked kernel to do the full write back to GMEM, so bandwidth measurements on writes are accurate.

bernhardmgruber avatar Sep 17 '25 15:09 bernhardmgruber

Perhaps, warm-up run should execute cudaCtxResetPersistingL2Cache() after the warm-up call? Ref: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g37ef93f921871331188f90fb2eb20e5e

For reproducibility sake, I assume you used CUDAToolkit 13.0 and driver 580 version?

oleksandr-pavlyk avatar Sep 17 '25 19:09 oleksandr-pavlyk

For reproducibility sake, I assume you used CUDAToolkit 13.0 and driver 580 version?

bgruber@concorde:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
bgruber@concorde:~$ nvidia-smi
Thu Sep 18 10:35:38 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |

bernhardmgruber avatar Sep 18 '25 08:09 bernhardmgruber