How can the output metrics be interpreted
I'm fairly new to arena of profiling CUDA kernels and would like to learn more about the basic output metrics of this library. Specifically, looking at the output of nvbench.example.throughput:
| NumElements | DataSize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | Samples | Batch GPU |
|-------------|------------|---------|------------|-------|------------|-------|---------|--------------|--------|---------|------------|
| 16777216 | 64.000 MiB | 1750x | 301.822 us | 5.83% | 285.873 us | 0.50% | 58.688G | 469.501 GB/s | 30.19% | 1809x | 279.783 us |
My questions per field are:
- NumElements: Is this the total number of elements processed over the total number of samples taken?
- DataSize: Same question. Is this the total data size processed over successive function calls?
- Samples: I assume this refers to the total number of times the benchamraked function is called. Is this correct? Also, why is it included twice and has a different value for each?
- CPU Time: Total time over successive calls or average?
- GPU Time: Same question. Total time or average?
- Noise: What is the definition of this metric?
- Elem/S: Self explanatory. The number of elements a kernel can process per sec I assume?
- GlobalMem BW: Does this measure the speed of device->host and host->device data movement?
- BW Util: The % being used of the maximum possible BW?
- Batch GPU: How is this different from GPU Time?
Thank you in advance. I just didn't see any documentation or output that explicitly defined these metrics or how to interpret.
I just didn't see any documentation or output that explicitly defined these metrics or how to interpret.
This is an area that we plan to improve before the 1.0 release. I appreciate the questions, it helps to spot items that need better docs. To answer your questions:
1,2) are per sample, not the sum of all samples.
- NVBench takes two sets of measurements: Isolated ("Cold") and Batch ("Hot"). Isolated benchmarks flush the L2 cache and measure each sample independently, while batch benchmarks repeatedly launch the same kernel multiple times with a single timer and a hot cache. See #11 for a more detailed description. The first "Samples" field is for the isolated measurements, and the second "Samples" field is for the batch measurements.
4,5) these are averages
-
noise is the relative standard deviation, expressed as a percentage of the average.
-
correct
-
This is the on-chip bandwidth for all accesses to the device's global memory, not host/device transfers.
-
Yes, this is the percentage of the theoretical peak bandwidth of the device.
-
See #11.
I'll leave this open until we have better docs for this stuff -- let us know if there's anything else that we can make clearer.
Thank you for your explanation of these concepts. I am not very clear about the concept of noise after your explanation. Can you explain it in more detail? In addition to these indicators are all tests will be output? Can I choose which indicators to output? If so, where can I choose?
Devices
[0] NVIDIA GeForce RTX 3090
- SM Version: 860 (PTX Version: 520)
- Number of SMs: 82
- SM Default Clock Rate: 1695 MHz
- Global Memory: 23447 MiB Free / 24259 MiB Total
- Global Memory Bus Peak: 936 GB/sec (384-bit DDR @9751MHz)
- Max Shared Memory: 100 KiB/SM, 48 KiB/Block
- L2 Cache Size: 6144 KiB
- Maximum Active Blocks: 16/SM
- Maximum Active Threads: 1536/SM, 1024/Block
- Available Registers: 65536/SM, 65536/Block
- ECC Enabled: No
Log
Run: [1/9] resize_benchmark [Device=0 blockDimx=2^3 blockDimy=2^3]
Pass: Cold: 0.073730ms GPU, 0.079243ms CPU, 0.50s total GPU, 6784x
Pass: Batch: 0.072141ms GPU, 0.54s total GPU, 7448x
Run: [2/9] resize_benchmark [Device=0 blockDimx=2^4 blockDimy=2^3]
Pass: Cold: 0.071241ms GPU, 0.076573ms CPU, 0.50s total GPU, 7024x
Pass: Batch: 0.073041ms GPU, 0.54s total GPU, 7342x
Run: [3/9] resize_benchmark [Device=0 blockDimx=2^5 blockDimy=2^3]
Pass: Cold: 0.071441ms GPU, 0.076742ms CPU, 0.50s total GPU, 7008x
Pass: Batch: 0.072574ms GPU, 0.53s total GPU, 7342x
Run: [4/9] resize_benchmark [Device=0 blockDimx=2^3 blockDimy=2^4]
Pass: Cold: 0.071292ms GPU, 0.076717ms CPU, 0.50s total GPU, 7024x
Pass: Batch: 0.072263ms GPU, 0.53s total GPU, 7342x
Run: [5/9] resize_benchmark [Device=0 blockDimx=2^4 blockDimy=2^4]
Pass: Cold: 0.072032ms GPU, 0.077577ms CPU, 0.50s total GPU, 6944x
Pass: Batch: 0.072866ms GPU, 0.53s total GPU, 7342x
Run: [6/9] resize_benchmark [Device=0 blockDimx=2^5 blockDimy=2^4]
Pass: Cold: 0.071790ms GPU, 0.077333ms CPU, 0.50s total GPU, 6976x
Pass: Batch: 0.072963ms GPU, 0.53s total GPU, 7248x
Run: [7/9] resize_benchmark [Device=0 blockDimx=2^3 blockDimy=2^5]
Pass: Cold: 0.072318ms GPU, 0.077639ms CPU, 0.50s total GPU, 6928x
Pass: Batch: 0.073379ms GPU, 0.53s total GPU, 7239x
Run: [8/9] resize_benchmark [Device=0 blockDimx=2^4 blockDimy=2^5]
Pass: Cold: 0.071990ms GPU, 0.077392ms CPU, 0.50s total GPU, 6960x
Pass: Batch: 0.073006ms GPU, 0.54s total GPU, 7342x
Run: [9/9] resize_benchmark [Device=0 blockDimx=2^5 blockDimy=2^5]
Pass: Cold: 0.088379ms GPU, 0.093626ms CPU, 0.50s total GPU, 5664x
Pass: Batch: 0.089679ms GPU, 0.53s total GPU, 5907x
Benchmark Results
resize_benchmark
[0] NVIDIA GeForce RTX 3090
| blockDimx | blockDimy | NumElements | inDataSize | outDataSize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWPeak | Batch GPU | Batch |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2^3 = 8 | 2^3 = 8 | 16070400 | 15.326 MiB | 15.326 MiB | 6784x | 79.243 us | 729073.04% | 73.730 us | 6.35% | 217.963G | 435.926 GB/s | 46.57% | 72.141 us | 7448x |
| 2^4 = 16 | 2^3 = 8 | 16070400 | 15.326 MiB | 15.326 MiB | 7024x | 76.573 us | 754913.20% | 71.241 us | 1.16% | 225.577G | 451.153 GB/s | 48.20% | 73.041 us | 7342x |
| 2^5 = 32 | 2^3 = 8 | 16070400 | 15.326 MiB | 15.326 MiB | 7008x | 76.742 us | 752747.91% | 71.441 us | 0.85% | 224.948G | 449.896 GB/s | 48.06% | 72.574 us | 7342x |
| 2^3 = 8 | 2^4 = 16 | 16070400 | 15.326 MiB | 15.326 MiB | 7024x | 76.717 us | 755803.32% | 71.292 us | 0.99% | 225.418G | 450.836 GB/s | 48.16% | 72.263 us | 7342x |
| 2^4 = 16 | 2^4 = 16 | 16070400 | 15.326 MiB | 15.326 MiB | 6944x | 77.577 us | 747796.34% | 72.032 us | 0.84% | 223.099G | 446.199 GB/s | 47.67% | 72.866 us | 7342x |
| 2^5 = 32 | 2^4 = 16 | 16070400 | 15.326 MiB | 15.326 MiB | 6976x | 77.333 us | 751404.85% | 71.790 us | 0.75% | 223.852G | 447.704 GB/s | 47.83% | 72.963 us | 7248x |
| 2^3 = 8 | 2^5 = 32 | 16070400 | 15.326 MiB | 15.326 MiB | 6928x | 77.639 us | 743720.83% | 72.318 us | 1.32% | 222.220G | 444.440 GB/s | 47.48% | 73.379 us | 7239x |
| 2^4 = 16 | 2^5 = 32 | 16070400 | 15.326 MiB | 15.326 MiB | 6960x | 77.392 us | 748172.71% | 71.990 us | 0.95% | 223.230G | 446.460 GB/s | 47.69% | 73.006 us | 7342x |
| 2^5 = 32 | 2^5 = 32 | 16070400 | 15.326 MiB | 15.326 MiB | 5664x | 93.626 us | 599971.62% | 88.379 us | 0.91% | 181.834G | 363.668 GB/s | 38.85% | 89.679 us | 5907x |
Hi, I try to benchmark my kernel, above is the output. My questions are
- Does the 'Batch' fileld have the similar meaning with 'Samples' field in isolated measurement? And why do the samples vary in different blockDimx and blockDimy configurations, are there any rules for these values?
- for isolated measurement, are the values in Noise field in valid range? Thank you sincerely.
-
Yes -- the
Batchcolumn is the number of kernel executions used for the batch measurements. The number of samples is determined dynamically based on a variety of criteria like noise, etc, so they will vary based on the characteristics of the kernel. -
This depends on your usecase. Most look fairly reasonable to me, though the 6% is pretty high IMO. Making sure that the GPU clocks are locked is the best way to make sure that noise values are minimized.