tpp-mlir icon indicating copy to clipboard operation
tpp-mlir copied to clipboard

Add support in the `perf` dialect to start/stop perf counters when we start/stop the timer

Open rengolin opened this issue 2 years ago • 3 comments

This will improve our benchmark strategy and should be a good chunk of work that we can upstream.

We also need to track them somehow. Today we have mean and stdev for the timer, but this now makes it harder as we'll have multiple values tracked.

Do we want to track mean and stdev for each perf counter too? If so, then we need to rename perf.mean to perf.timer_mean etc. If not, then do we just dump total sum and divide by %n at the end?

What counters do we add? We can start with basics (cache, TLB, branch), but people may want more stuff. Do we allow custom counters?

Do we add them all on perf.start_timer, and then allow users to print them individually? Do we allow users to enable different counters (perf.start_cache_miss), and if so, what's the result of perf.show_counter for something that hasn't been enabled?

To begin with, I'd propose a simple implementation:

  • Rename perf.start_timer to perf.start and perf.stop_timer to perf.stop and add all basic counters, including timer.
  • Keep a list of each of these counters, per run, on the runtime (like we do for timer).
  • Enable everything on perf.start and disable everything on perf.stop and append the tracked values to the runtime arrays.
  • Rename perf.mean and perf.stdev to perf.timer, returning two values (mean, stdev).
  • Create similar accessors for the other counters (ex. perf.cache_miss), also returning two values each.

@adam-smnk @chelini

rengolin avatar Nov 17 '23 15:11 rengolin

IMHO perf counters other than time don't often make much sense on a start / stop basis. Eg. cache misses in our applications are normally bursty (we load a matrix panel we reuse), so the cache miss is not problematic, when we reuse. But also we might want to parallelizations which have more cache misses for better performance (when we are not BW bound), so lower cache miss rate is not always better.

perf counter work best when doing sampling over a timeline. So I suggest integration with Vtune API for sampling... but first we need to really see if we need this....

alheinecke avatar Nov 19 '23 22:11 alheinecke

So I suggest integration with Vtune API for sampling...

Side note: VTune would not be upstream-friendly.

rengolin avatar Nov 21 '23 13:11 rengolin

Side note: VTune would not be upstream-friendly.

Ideally perf API would be generic enough to allow downstream bindings like this.

adam-smnk avatar Nov 21 '23 13:11 adam-smnk