nvbench Add best practices documentation

This PR adds a best practices guide for NVBench, providing code examples to help users quickly get started and conduct effective performance comparisons in real-world scenarios.

Sep 19 '25 20:09 PointKernel

@PointKernel I like the document as a friendly "getting started guide".

I would expect the best practices guide to provide answers to:

nvbench vs. NCU
should I or should I not lock GPU frequency
should I use a single kernel in the launchable lambda per benchmark, or can I use multiple kernels
how to do performance tuning, with a kernel example and reasoning for arriving at an optimal choice of parameters

So perhaps the document could be renamed, otherwise looks good to me.

Oct 02 '25 21:10 oleksandr-pavlyk

Regarding the comparison between nvbench and NCU, could you elaborate a bit on what kind of information would be most useful for users? From my perspective, nvbench primarily gathers runtime data and basic metrics, which feels more similar to NSYS, whereas NCU provides deeper kernel-level profiling with detailed hardware utilization insights.

Yes, I should have been more clear. What I had in mind was the difference in kernel runtime estimate by NCU and by nvbench. I was just bitten by a discrepancy in timings caused by NCU locking GPU frequency while timing, and nvbench not doing that. I needed to use --clock-control none option on NCU to get timings measured by NCU to agree with those reported by nvbench.

While NCU may have other reasons to lock frequency (to get sampling-based estimates, such as stall rates, more accurate), GPU Mode talk https://www.youtube.com/watch?v=CtrqBmYtSEk by @gevtushenko makes the point that timings obtained with locked frequency may not be representative of real-world kernel performance.

Oct 02 '25 21:10 PointKernel