benchmark
benchmark copied to clipboard
[FR] Support Dynamic PMU detection
Currently, on modern HW, where multiple PMU counters can be recorded for single run (example: Icaleake with 16 concurrent PMU counters, the code perf_counters.cc
hard codes a limit of 3 counters globally.
I'd like to use libpfm's internal API to detect at runtime the PMU that each requested counter is associated with,
and internally track how many counters are "consumed" from each PMU given the information retrieved from calling
pfm_get_pmu_info()
instead of the current hard-coded limit of 3 built into the code.
I opening this issue in preparation of providing a PR that would implement such logic, and wanted to see if this is something that needs more discussion / blessing before submitting a PR. I have already started some preliminary work on tracking the requested counters vs. the availability of each PMU.
i think @mtrofin was looking at something similar...
OK, let me know if there is already something in progress, I think I might be able to get something into a PR form by the weekend if you guys like it?
I was looking to build internal switching, i.e. assuming the limit is N, but the user wants P = kN+r counters, allow them to specify P counters and then, internally, execute the workload k+1 times.
I believe this FR is orthogonal. One recommendation: please ensure the storage in PerfCounterValues is still inlined, to avoid risk of additional cache misses.
@mtrofin thanks for pinging back. Yeah, your feature suggestion to do re-execute the workload until you get all the requested counters is both orthogonal and yet connected (in the sense that both would change the existing code base).
Is this something you started work on or just planning to at this point in time? I already have some basic code that tracks the counters and aggregates them into per-PMU and fixed/non-fixed counter "counts" for the "budgeting" aspects of this, I wouldn't mind re-writing if you have a more mature branch?
I don't have anything done.