Enable CUPTI to measure kernel execution time instead of CUDA Events

Open fbusato opened this issue 1 year ago • 1 comments

CUDA events suffer from low accuracy and include the kernel launch overhead. On the other hand, CUPTI provides a more reliable way to get consistent timing measurement. This request asks to add an option to replace CUDA Events with CUPTI.

Details

CUDA events issues:

Accuracy and Stability:
- cudaEvent can fluctuate in the range of 10-30us, making measurements of small computations unreliable
- cudaEvent take into account the kernel launch overhead that depends on host/CPU execution and/or driver version

CUPTI:

~0.5us granularity vs. 10-30us
Not affected by kernel launch overhead
Consistency: measurements close to the profiler (nsys)
Efficiency: avoid using waiting/delay kernels to hide CPU overhead

Aug 29 '24 22:08 fbusato

We do mitigate a lot of the issues with events by using blocking_kernels, so it's not quite as bad as it seems. I think this would be a great addition, I'm curious how much this would improve the stability of our results, especially when sync tags are used.

Sep 04 '24 15:09 alliepiper