nvbench icon indicating copy to clipboard operation
nvbench copied to clipboard

Enable CUPTI to measure kernel execution time instead of CUDA Events

Open fbusato opened this issue 1 year ago • 1 comments

CUDA events suffer from low accuracy and include the kernel launch overhead. On the other hand, CUPTI provides a more reliable way to get consistent timing measurement. This request asks to add an option to replace CUDA Events with CUPTI.

Details

CUDA events issues:

  • Accuracy and Stability:
    • cudaEvent can fluctuate in the range of 10-30us, making measurements of small computations unreliable
    • cudaEvent take into account the kernel launch overhead that depends on host/CPU execution and/or driver version

CUPTI:

  • ~0.5us granularity vs. 10-30us
  • Not affected by kernel launch overhead
  • Consistency: measurements close to the profiler (nsys)
  • Efficiency: avoid using waiting/delay kernels to hide CPU overhead

fbusato avatar Aug 29 '24 22:08 fbusato

We do mitigate a lot of the issues with events by using blocking_kernels, so it's not quite as bad as it seems. I think this would be a great addition, I'm curious how much this would improve the stability of our results, especially when sync tags are used.

alliepiper avatar Sep 04 '24 15:09 alliepiper