nvbench
nvbench copied to clipboard
Enable CUPTI to measure kernel execution time instead of CUDA Events
CUDA events suffer from low accuracy and include the kernel launch overhead. On the other hand, CUPTI provides a more reliable way to get consistent timing measurement. This request asks to add an option to replace CUDA Events with CUPTI.
Details
CUDA events issues:
-
Accuracy and Stability:
-
cudaEventcan fluctuate in the range of 10-30us, making measurements of small computations unreliable -
cudaEventtake into account the kernel launch overhead that depends on host/CPU execution and/or driver version
-
CUPTI:
- ~0.5us granularity vs. 10-30us
- Not affected by kernel launch overhead
- Consistency: measurements close to the profiler (nsys)
- Efficiency: avoid using waiting/delay kernels to hide CPU overhead
We do mitigate a lot of the issues with events by using blocking_kernels, so it's not quite as bad as it seems. I think this would be a great addition, I'm curious how much this would improve the stability of our results, especially when sync tags are used.