benchmark icon indicating copy to clipboard operation
benchmark copied to clipboard

added INSTRUCTIONS and CYCLES hardware perf counters on MacOS.

Open robwyatt opened this issue 2 years ago • 5 comments

MacOS has a semi undocumented API in libpthread that returns the INSTRUCTION and CYCLE counts of the calling thread. The API is thread_selfcounts and is available on Intel and Apple Silicon hardware. Using the API, this PR cleanly integrates with the existing performance counter code and allows those two counters.

For example you can use --benchmark_perf_counters=INSTRUCTIONS,CYCLES just like you can on Linux but no external dependencies are required as everything needed is in libpthread.

The performance counter documentation has been updated to reflect this change.

robwyatt avatar Jun 02 '22 04:06 robwyatt

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

google-cla[bot] avatar Jun 02 '22 04:06 google-cla[bot]

Shouldn't this go into perfmon itself?

LebedevRI avatar Jun 02 '22 13:06 LebedevRI

I looked at that but it's a hell of an effort to get perfmon to build on MacOS and support the whole API but it's difficult because it uses read() with a driver provided fd to get data - this is not how MacOS works and from user mode isn't really possible.

It also seems like software engineering overkill to make a a whole library to wrap a single API that is built in to the OS (the OS already does a lot of work to make those two counters behave the same on both platforms, the counters are thread specific and context switching and processor migration is handled so the counters are rock solid). This PR just calls that API where the perform version calls read(), to me it seems like a fair change and it has zero overhead and zero additional dependencies, ultimately its a 50 line addition, no existing code was changed. I understand if you don't agree but I'd like to hear your thoughts on how to get around the file descriptor/read() problem.

Now, if we had easy access to the whole set of counters then the full library would make sense but the read() problem is not trivial even from kernel mode. Getting access to the the hardware counters on MacOS is a challenge and completely different on Intel and Arm so it would be two completely independent implementations and on top of that kernel drivers are tricky especially on Apple Silicon. On both platforms Apple assume they have sole access to the counters and writing a kernel driver would break things like instruments and the existing profile tools.

Sorry for the long reply, but I did look at all the options and I'm pretty sure this is the best solution.

robwyatt avatar Jun 02 '22 15:06 robwyatt

please check the clang format (and any other build failures) :)

dmah42 avatar Jun 06 '22 09:06 dmah42

@robwyatt did you intend to get back to this?

dmah42 avatar Feb 07 '23 13:02 dmah42