I'm fairly new to arena of profiling CUDA kernels and would like to learn more about the basic output metrics of this library. Specifically, looking at the output of nvbench.example.throughput:

| NumElements |  DataSize  | Samples |  CPU Time  | Noise |  GPU Time  | Noise | Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU  |
|-------------|------------|---------|------------|-------|------------|-------|---------|--------------|--------|---------|------------|
|    16777216 | 64.000 MiB |   1750x | 301.822 us | 5.83% | 285.873 us | 0.50% | 58.688G | 469.501 GB/s | 30.19% |   1809x | 279.783 us |

My questions per field are:

NumElements: Is this the total number of elements processed over the total number of samples taken?
DataSize: Same question. Is this the total data size processed over successive function calls?
Samples: I assume this refers to the total number of times the benchamraked function is called. Is this correct? Also, why is it included twice and has a different value for each?
CPU Time: Total time over successive calls or average?
GPU Time: Same question. Total time or average?
Noise: What is the definition of this metric?
Elem/S: Self explanatory. The number of elements a kernel can process per sec I assume?
GlobalMem BW: Does this measure the speed of device->host and host->device data movement?
BW Util: The % being used of the maximum possible BW?
Batch GPU: How is this different from GPU Time?

Thank you in advance. I just didn't see any documentation or output that explicitly defined these metrics or how to interpret.

Dec 19 '22 19:12 seanajohnston

I just didn't see any documentation or output that explicitly defined these metrics or how to interpret.

This is an area that we plan to improve before the 1.0 release. I appreciate the questions, it helps to spot items that need better docs. To answer your questions:

1,2) are per sample, not the sum of all samples.

NVBench takes two sets of measurements: Isolated ("Cold") and Batch ("Hot"). Isolated benchmarks flush the L2 cache and measure each sample independently, while batch benchmarks repeatedly launch the same kernel multiple times with a single timer and a hot cache. See #11 for a more detailed description. The first "Samples" field is for the isolated measurements, and the second "Samples" field is for the batch measurements.

4,5) these are averages

noise is the relative standard deviation, expressed as a percentage of the average.
correct
This is the on-chip bandwidth for all accesses to the device's global memory, not host/device transfers.
Yes, this is the percentage of the theoretical peak bandwidth of the device.
See #11.

I'll leave this open until we have better docs for this stuff -- let us know if there's anything else that we can make clearer.

Jan 30 '23 16:01 alliepiper

Thank you for your explanation of these concepts. I am not very clear about the concept of noise after your explanation. Can you explain it in more detail? In addition to these indicators are all tests will be output? Can I choose which indicators to output? If so, where can I choose?

Mar 02 '23 05:03 Joker1213

Devices

[0] `NVIDIA GeForce RTX 3090`

SM Version: 860 (PTX Version: 520)
Number of SMs: 82
SM Default Clock Rate: 1695 MHz
Global Memory: 23447 MiB Free / 24259 MiB Total
Global Memory Bus Peak: 936 GB/sec (384-bit DDR @9751MHz)
Max Shared Memory: 100 KiB/SM, 48 KiB/Block
L2 Cache Size: 6144 KiB
Maximum Active Blocks: 16/SM
Maximum Active Threads: 1536/SM, 1024/Block
Available Registers: 65536/SM, 65536/Block
ECC Enabled: No

Log

Run:  [1/9] resize_benchmark [Device=0 blockDimx=2^3 blockDimy=2^3]
Pass: Cold: 0.073730ms GPU, 0.079243ms CPU, 0.50s total GPU, 6784x
Pass: Batch: 0.072141ms GPU, 0.54s total GPU, 7448x
Run:  [2/9] resize_benchmark [Device=0 blockDimx=2^4 blockDimy=2^3]
Pass: Cold: 0.071241ms GPU, 0.076573ms CPU, 0.50s total GPU, 7024x
Pass: Batch: 0.073041ms GPU, 0.54s total GPU, 7342x
Run:  [3/9] resize_benchmark [Device=0 blockDimx=2^5 blockDimy=2^3]
Pass: Cold: 0.071441ms GPU, 0.076742ms CPU, 0.50s total GPU, 7008x
Pass: Batch: 0.072574ms GPU, 0.53s total GPU, 7342x
Run:  [4/9] resize_benchmark [Device=0 blockDimx=2^3 blockDimy=2^4]
Pass: Cold: 0.071292ms GPU, 0.076717ms CPU, 0.50s total GPU, 7024x
Pass: Batch: 0.072263ms GPU, 0.53s total GPU, 7342x
Run:  [5/9] resize_benchmark [Device=0 blockDimx=2^4 blockDimy=2^4]
Pass: Cold: 0.072032ms GPU, 0.077577ms CPU, 0.50s total GPU, 6944x
Pass: Batch: 0.072866ms GPU, 0.53s total GPU, 7342x
Run:  [6/9] resize_benchmark [Device=0 blockDimx=2^5 blockDimy=2^4]
Pass: Cold: 0.071790ms GPU, 0.077333ms CPU, 0.50s total GPU, 6976x
Pass: Batch: 0.072963ms GPU, 0.53s total GPU, 7248x
Run:  [7/9] resize_benchmark [Device=0 blockDimx=2^3 blockDimy=2^5]
Pass: Cold: 0.072318ms GPU, 0.077639ms CPU, 0.50s total GPU, 6928x
Pass: Batch: 0.073379ms GPU, 0.53s total GPU, 7239x
Run:  [8/9] resize_benchmark [Device=0 blockDimx=2^4 blockDimy=2^5]
Pass: Cold: 0.071990ms GPU, 0.077392ms CPU, 0.50s total GPU, 6960x
Pass: Batch: 0.073006ms GPU, 0.54s total GPU, 7342x
Run:  [9/9] resize_benchmark [Device=0 blockDimx=2^5 blockDimy=2^5]
Pass: Cold: 0.088379ms GPU, 0.093626ms CPU, 0.50s total GPU, 5664x
Pass: Batch: 0.089679ms GPU, 0.53s total GPU, 5907x

Benchmark Results

resize_benchmark

[0] NVIDIA GeForce RTX 3090

blockDimx	blockDimy	NumElements	inDataSize	outDataSize	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	GlobalMem BW	BWPeak	Batch GPU	Batch
2^3 = 8	2^3 = 8	16070400	15.326 MiB	15.326 MiB	6784x	79.243 us	729073.04%	73.730 us	6.35%	217.963G	435.926 GB/s	46.57%	72.141 us	7448x
2^4 = 16	2^3 = 8	16070400	15.326 MiB	15.326 MiB	7024x	76.573 us	754913.20%	71.241 us	1.16%	225.577G	451.153 GB/s	48.20%	73.041 us	7342x
2^5 = 32	2^3 = 8	16070400	15.326 MiB	15.326 MiB	7008x	76.742 us	752747.91%	71.441 us	0.85%	224.948G	449.896 GB/s	48.06%	72.574 us	7342x
2^3 = 8	2^4 = 16	16070400	15.326 MiB	15.326 MiB	7024x	76.717 us	755803.32%	71.292 us	0.99%	225.418G	450.836 GB/s	48.16%	72.263 us	7342x
2^4 = 16	2^4 = 16	16070400	15.326 MiB	15.326 MiB	6944x	77.577 us	747796.34%	72.032 us	0.84%	223.099G	446.199 GB/s	47.67%	72.866 us	7342x
2^5 = 32	2^4 = 16	16070400	15.326 MiB	15.326 MiB	6976x	77.333 us	751404.85%	71.790 us	0.75%	223.852G	447.704 GB/s	47.83%	72.963 us	7248x
2^3 = 8	2^5 = 32	16070400	15.326 MiB	15.326 MiB	6928x	77.639 us	743720.83%	72.318 us	1.32%	222.220G	444.440 GB/s	47.48%	73.379 us	7239x
2^4 = 16	2^5 = 32	16070400	15.326 MiB	15.326 MiB	6960x	77.392 us	748172.71%	71.990 us	0.95%	223.230G	446.460 GB/s	47.69%	73.006 us	7342x
2^5 = 32	2^5 = 32	16070400	15.326 MiB	15.326 MiB	5664x	93.626 us	599971.62%	88.379 us	0.91%	181.834G	363.668 GB/s	38.85%	89.679 us	5907x

Hi, I try to benchmark my kernel, above is the output. My questions are

Does the 'Batch' fileld have the similar meaning with 'Samples' field in isolated measurement? And why do the samples vary in different blockDimx and blockDimy configurations, are there any rules for these values?
for isolated measurement, are the values in Noise field in valid range? Thank you sincerely.

Dec 06 '23 07:12 thishome

Yes -- the Batch column is the number of kernel executions used for the batch measurements. The number of samples is determined dynamically based on a variety of criteria like noise, etc, so they will vary based on the characteristics of the kernel.
This depends on your usecase. Most look fairly reasonable to me, though the 6% is pretty high IMO. Making sure that the GPU clocks are locked is the best way to make sure that noise values are minimized.

Dec 12 '23 18:12 alliepiper

How can the output metrics be interpreted

Devices

[0] NVIDIA GeForce RTX 3090

Log

Benchmark Results

resize_benchmark

[0] NVIDIA GeForce RTX 3090

[0] `NVIDIA GeForce RTX 3090`