nvbench icon indicating copy to clipboard operation
nvbench copied to clipboard

How can the output metrics be interpreted

Open seanajohnston opened this issue 3 years ago • 4 comments

I'm fairly new to arena of profiling CUDA kernels and would like to learn more about the basic output metrics of this library. Specifically, looking at the output of nvbench.example.throughput:

| NumElements |  DataSize  | Samples |  CPU Time  | Noise |  GPU Time  | Noise | Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU  |
|-------------|------------|---------|------------|-------|------------|-------|---------|--------------|--------|---------|------------|
|    16777216 | 64.000 MiB |   1750x | 301.822 us | 5.83% | 285.873 us | 0.50% | 58.688G | 469.501 GB/s | 30.19% |   1809x | 279.783 us |

My questions per field are:

  1. NumElements: Is this the total number of elements processed over the total number of samples taken?
  2. DataSize: Same question. Is this the total data size processed over successive function calls?
  3. Samples: I assume this refers to the total number of times the benchamraked function is called. Is this correct? Also, why is it included twice and has a different value for each?
  4. CPU Time: Total time over successive calls or average?
  5. GPU Time: Same question. Total time or average?
  6. Noise: What is the definition of this metric?
  7. Elem/S: Self explanatory. The number of elements a kernel can process per sec I assume?
  8. GlobalMem BW: Does this measure the speed of device->host and host->device data movement?
  9. BW Util: The % being used of the maximum possible BW?
  10. Batch GPU: How is this different from GPU Time?

Thank you in advance. I just didn't see any documentation or output that explicitly defined these metrics or how to interpret.

seanajohnston avatar Dec 19 '22 19:12 seanajohnston

I just didn't see any documentation or output that explicitly defined these metrics or how to interpret.

This is an area that we plan to improve before the 1.0 release. I appreciate the questions, it helps to spot items that need better docs. To answer your questions:

1,2) are per sample, not the sum of all samples.

  1. NVBench takes two sets of measurements: Isolated ("Cold") and Batch ("Hot"). Isolated benchmarks flush the L2 cache and measure each sample independently, while batch benchmarks repeatedly launch the same kernel multiple times with a single timer and a hot cache. See #11 for a more detailed description. The first "Samples" field is for the isolated measurements, and the second "Samples" field is for the batch measurements.

4,5) these are averages

  1. noise is the relative standard deviation, expressed as a percentage of the average.

  2. correct

  3. This is the on-chip bandwidth for all accesses to the device's global memory, not host/device transfers.

  4. Yes, this is the percentage of the theoretical peak bandwidth of the device.

  5. See #11.

I'll leave this open until we have better docs for this stuff -- let us know if there's anything else that we can make clearer.

alliepiper avatar Jan 30 '23 16:01 alliepiper

Thank you for your explanation of these concepts. I am not very clear about the concept of noise after your explanation. Can you explain it in more detail? In addition to these indicators are all tests will be output? Can I choose which indicators to output? If so, where can I choose?

Joker1213 avatar Mar 02 '23 05:03 Joker1213

Devices

[0] NVIDIA GeForce RTX 3090

  • SM Version: 860 (PTX Version: 520)
  • Number of SMs: 82
  • SM Default Clock Rate: 1695 MHz
  • Global Memory: 23447 MiB Free / 24259 MiB Total
  • Global Memory Bus Peak: 936 GB/sec (384-bit DDR @9751MHz)
  • Max Shared Memory: 100 KiB/SM, 48 KiB/Block
  • L2 Cache Size: 6144 KiB
  • Maximum Active Blocks: 16/SM
  • Maximum Active Threads: 1536/SM, 1024/Block
  • Available Registers: 65536/SM, 65536/Block
  • ECC Enabled: No

Log

Run:  [1/9] resize_benchmark [Device=0 blockDimx=2^3 blockDimy=2^3]
Pass: Cold: 0.073730ms GPU, 0.079243ms CPU, 0.50s total GPU, 6784x
Pass: Batch: 0.072141ms GPU, 0.54s total GPU, 7448x
Run:  [2/9] resize_benchmark [Device=0 blockDimx=2^4 blockDimy=2^3]
Pass: Cold: 0.071241ms GPU, 0.076573ms CPU, 0.50s total GPU, 7024x
Pass: Batch: 0.073041ms GPU, 0.54s total GPU, 7342x
Run:  [3/9] resize_benchmark [Device=0 blockDimx=2^5 blockDimy=2^3]
Pass: Cold: 0.071441ms GPU, 0.076742ms CPU, 0.50s total GPU, 7008x
Pass: Batch: 0.072574ms GPU, 0.53s total GPU, 7342x
Run:  [4/9] resize_benchmark [Device=0 blockDimx=2^3 blockDimy=2^4]
Pass: Cold: 0.071292ms GPU, 0.076717ms CPU, 0.50s total GPU, 7024x
Pass: Batch: 0.072263ms GPU, 0.53s total GPU, 7342x
Run:  [5/9] resize_benchmark [Device=0 blockDimx=2^4 blockDimy=2^4]
Pass: Cold: 0.072032ms GPU, 0.077577ms CPU, 0.50s total GPU, 6944x
Pass: Batch: 0.072866ms GPU, 0.53s total GPU, 7342x
Run:  [6/9] resize_benchmark [Device=0 blockDimx=2^5 blockDimy=2^4]
Pass: Cold: 0.071790ms GPU, 0.077333ms CPU, 0.50s total GPU, 6976x
Pass: Batch: 0.072963ms GPU, 0.53s total GPU, 7248x
Run:  [7/9] resize_benchmark [Device=0 blockDimx=2^3 blockDimy=2^5]
Pass: Cold: 0.072318ms GPU, 0.077639ms CPU, 0.50s total GPU, 6928x
Pass: Batch: 0.073379ms GPU, 0.53s total GPU, 7239x
Run:  [8/9] resize_benchmark [Device=0 blockDimx=2^4 blockDimy=2^5]
Pass: Cold: 0.071990ms GPU, 0.077392ms CPU, 0.50s total GPU, 6960x
Pass: Batch: 0.073006ms GPU, 0.54s total GPU, 7342x
Run:  [9/9] resize_benchmark [Device=0 blockDimx=2^5 blockDimy=2^5]
Pass: Cold: 0.088379ms GPU, 0.093626ms CPU, 0.50s total GPU, 5664x
Pass: Batch: 0.089679ms GPU, 0.53s total GPU, 5907x

Benchmark Results

resize_benchmark

[0] NVIDIA GeForce RTX 3090

blockDimx blockDimy NumElements inDataSize outDataSize Samples CPU Time Noise GPU Time Noise Elem/s GlobalMem BW BWPeak Batch GPU Batch
2^3 = 8 2^3 = 8 16070400 15.326 MiB 15.326 MiB 6784x 79.243 us 729073.04% 73.730 us 6.35% 217.963G 435.926 GB/s 46.57% 72.141 us 7448x
2^4 = 16 2^3 = 8 16070400 15.326 MiB 15.326 MiB 7024x 76.573 us 754913.20% 71.241 us 1.16% 225.577G 451.153 GB/s 48.20% 73.041 us 7342x
2^5 = 32 2^3 = 8 16070400 15.326 MiB 15.326 MiB 7008x 76.742 us 752747.91% 71.441 us 0.85% 224.948G 449.896 GB/s 48.06% 72.574 us 7342x
2^3 = 8 2^4 = 16 16070400 15.326 MiB 15.326 MiB 7024x 76.717 us 755803.32% 71.292 us 0.99% 225.418G 450.836 GB/s 48.16% 72.263 us 7342x
2^4 = 16 2^4 = 16 16070400 15.326 MiB 15.326 MiB 6944x 77.577 us 747796.34% 72.032 us 0.84% 223.099G 446.199 GB/s 47.67% 72.866 us 7342x
2^5 = 32 2^4 = 16 16070400 15.326 MiB 15.326 MiB 6976x 77.333 us 751404.85% 71.790 us 0.75% 223.852G 447.704 GB/s 47.83% 72.963 us 7248x
2^3 = 8 2^5 = 32 16070400 15.326 MiB 15.326 MiB 6928x 77.639 us 743720.83% 72.318 us 1.32% 222.220G 444.440 GB/s 47.48% 73.379 us 7239x
2^4 = 16 2^5 = 32 16070400 15.326 MiB 15.326 MiB 6960x 77.392 us 748172.71% 71.990 us 0.95% 223.230G 446.460 GB/s 47.69% 73.006 us 7342x
2^5 = 32 2^5 = 32 16070400 15.326 MiB 15.326 MiB 5664x 93.626 us 599971.62% 88.379 us 0.91% 181.834G 363.668 GB/s 38.85% 89.679 us 5907x

Hi, I try to benchmark my kernel, above is the output. My questions are

  1. Does the 'Batch' fileld have the similar meaning with 'Samples' field in isolated measurement? And why do the samples vary in different blockDimx and blockDimy configurations, are there any rules for these values?
  2. for isolated measurement, are the values in Noise field in valid range? Thank you sincerely.

thishome avatar Dec 06 '23 07:12 thishome

  1. Yes -- the Batch column is the number of kernel executions used for the batch measurements. The number of samples is determined dynamically based on a variety of criteria like noise, etc, so they will vary based on the characteristics of the kernel.

  2. This depends on your usecase. Most look fairly reasonable to me, though the 6% is pretty high IMO. Making sure that the GPU clocks are locked is the best way to make sure that noise values are minimized.

alliepiper avatar Dec 12 '23 18:12 alliepiper