tpp-mlir
tpp-mlir copied to clipboard
Add support in the `perf` dialect to start/stop perf counters when we start/stop the timer
This will improve our benchmark strategy and should be a good chunk of work that we can upstream.
We also need to track them somehow. Today we have mean and stdev for the timer, but this now makes it harder as we'll have multiple values tracked.
Do we want to track mean and stdev for each perf counter too? If so, then we need to rename perf.mean to perf.timer_mean etc. If not, then do we just dump total sum and divide by %n at the end?
What counters do we add? We can start with basics (cache, TLB, branch), but people may want more stuff. Do we allow custom counters?
Do we add them all on perf.start_timer, and then allow users to print them individually? Do we allow users to enable different counters (perf.start_cache_miss), and if so, what's the result of perf.show_counter for something that hasn't been enabled?
To begin with, I'd propose a simple implementation:
- Rename
perf.start_timertoperf.startandperf.stop_timertoperf.stopand add all basic counters, including timer. - Keep a list of each of these counters, per run, on the runtime (like we do for timer).
- Enable everything on
perf.startand disable everything onperf.stopand append the tracked values to the runtime arrays. - Rename
perf.meanandperf.stdevtoperf.timer, returning two values (mean, stdev). - Create similar accessors for the other counters (ex.
perf.cache_miss), also returning two values each.
@adam-smnk @chelini
IMHO perf counters other than time don't often make much sense on a start / stop basis. Eg. cache misses in our applications are normally bursty (we load a matrix panel we reuse), so the cache miss is not problematic, when we reuse. But also we might want to parallelizations which have more cache misses for better performance (when we are not BW bound), so lower cache miss rate is not always better.
perf counter work best when doing sampling over a timeline. So I suggest integration with Vtune API for sampling... but first we need to really see if we need this....
So I suggest integration with Vtune API for sampling...
Side note: VTune would not be upstream-friendly.
Side note: VTune would not be upstream-friendly.
Ideally perf API would be generic enough to allow downstream bindings like this.