clvk
clvk copied to clipboard
Add timing.hpp
Approach
A single timer is a static storage duration object with a name/description, accumulates timings over the course of the application run, and prints out the total and average times at exit. Utility macros are provided to easily create scoped timers (e.g. for a function or block) and unscoped timers (manual start/stop).
A new CMake option, CLVK_ENABLE_TIMING
, is used to turn on internal timing (off by default). This means that timers can be safely left in the codebase for things that frequently need to be profiled, without introducing overheads for builds that run without timing enabled.
Timing output is printed with cvk_warn
so that it can be used at CLVK_LOG<=2
(higher levels tend to introduce noise to performance due to excessive logging).
I (very roughly) measured overheads of about 200ns per timed region on an Android device. This is small enough that I haven't seen measurement noise for any of what I've used this for so far. I did try using this to time refcounted::{retain,release}
and it did move the needle there, but not by a huge amount.
Usage
Just #include "timing.hpp"
in a file that needs timing, and then use the macros as follows:
Time a function: (uses the function name as the timer description)
void cvk_program::do_build() {
CVK_TIMED_FUNCTION;
// function body
}
Time a block:
if (do_work) {
CVK_TIMED_BLOCK(foo_timer, "doing foo");
// do foo
}
Unscoped timing:
CVK_UNSCOPED_TIMER(foo_timer, "doing foo");
for (...) {
// do stuff
foo_timer.start();
// do foo
foo_timer.stop();
// do more stuff
}
// Can also start the timer at the same time as definition for convenience:
CVK_UNSCOPED_TIMER(bar_timer, "doing bar").start();
// do bar
bar_timer.stop();
Example output
Timing some API calls and some cvk_program::do_build()
internals currently produces output like this:
[CLVK] 229.69 ms -> clReleaseProgram (1632 blocks, avg 0.141 ms)
[CLVK] 334.05 ms -> clEnqueueNDRangeKernel (4820 blocks, avg 0.069 ms)
[CLVK] 0.28 ms -> clGetKernelWorkGroupInfo (2838 blocks, avg 0.000 ms)
[CLVK] 290.23 ms -> clCreateKernel (1419 blocks, avg 0.205 ms)
[CLVK] 149.62 ms -> stripping reflection info (213 blocks, avg 0.702 ms)
[CLVK] 433.32 ms -> SPIR-V validation (213 blocks, avg 2.034 ms)
[CLVK] 595.00 ms -> do_build (213 blocks, avg 2.793 ms)
[CLVK] 654.32 ms -> clBuildProgram (213 blocks, avg 3.072 ms)
Future improvements?
Happy to discuss/change the output format, names of macros, or any other aspect of this. No rush to review/land either, since it's easy to keep as a local patch.
It'd be nice to sort the output or at least group timers by compilation unit (e.g. all OpenCL APIs together), but I haven't come up with a nice way to do this yet (and it isn't that important really). Even better would be track timing hierarchies so that we can do inclusive/exclusive times (e.g. do_build
includes some of the other stuff in the above output), but this might make things too heavyweight (and again isn't that important).