uarch-bench
uarch-bench copied to clipboard
Support for non-x86 architectures
Seems like this would be pretty difficult, but I'd love to have something like this working on other architectures, especially ARM.
Indeed, it's "on the roadmap" so to speak. The idea is that most of the code should be as portable as reasonable, with only the required amount of assembly. I'm also planning "C" benchmarks for various things: some things are only reasonable in assembly, but many can be done in C too, making them automatic on other archs.
I do have an Android phone, so probably I can help do this as well. Right now I'm working on the x86 performance counter support, however, and this is almost done.
I do have an Android phone, so probably I can help do this as well. Right now I'm working on the x86 performance counter support, however, and this is almost done.
If you want, I can give you an SSH account on a Raspberry Pi 3 I have sitting around for development, running Fedora 26.
OK, I may take you up on that offer if it's still open when I get to this!
I just came across __builtin_readcyclecounter
which I didn't know existed, though maybe you did. I know uarch-bench does a lot more than just a rdtsc
so I'm not sure if it's usable or not, but I thought I'd mention it in case it is.
Apparently it's been around for a while (at least since clang 3.4, didn't bother checking past that), though it doesn't work everywhere. According to a SO answer it does work on AArch64…
@nemequ - thanks for the note, I didn't know about it! Some thoughts (probably most of this is not news to you, but it's helpful for me to write it anyways):
On x86 it uses rdtsc
which makes the name a little bit wrong: it's counting wall-clock time, not cycle time[0]. I don't actually use rdtsc
directly at all in uarch-bench at the moment: if you use the default timer, it just uses std::chrono::high_resolution_clock::now()
, which pretty much directly calls clock_gettime()
on Linux which in turn is implemented in the VDSO as a usermode call to rdtsc
and some adjustment. So I am kind of using rdtsc
, but in an indirect way (AFAIK the overhead is perhaps 2x a raw rdtsc
call, with the benefit that I'm using a portable C++ implementation). The way the tests and scaffolding are written, we usually do several loops, and also try to subtract out the clock overhead, so the absolute overhead itself isn't a problem: stability is more important (that said, no doubt a raw rdtsc
call will be more stable as well - many few sources of variance).
Unfortunately, godbolt doesn't seem to have any clang-ARM targets (it does have gcc-ARM, but gcc doesn't support this builtin).
All that to say that if the compiler in question on ARM implements high_resolution_clock::now()
in a similarly efficient way, then the default timer should more or less just already work with reasonable performance[1]. Still on both x86 and ARM it's probably worth adding a mode that uses rdtsc
directly (via this builtin or inline asm) to reduce the variance.
Now the more interesting timer is the --timer=libpfc
one, which gives you access to the PMU and is what I usually use. Not only does it often give you cycle-accurate measurements (at least in some modes and some types of benchmark), but you can add other interesting events and have them displayed alongside the cycle results. To get that to work on ARM we'd need a library like libpfc
that gives access to the performance counters. I know they exist on ARM, but I don't know, for example, if there is a "user mode" instruction to read them. The existence of that on x86 (note: the OS needs to give you permission to use it) is what makes the cycle-accurate timings possible.
[0] Except on a small slice of decade-old CPUs around the time frequency scaling was becoming popular where rdtsc
briefly counted in cycles even though the frequency could vary. That made it suck for implementing time APIs, which are much more popular than cycle APIs, so Intel changed it. Notably, it's still implemented under the covers as a true cycle (not time) counter, with adjustment logic to scale the count based on the current frequency, which makes it slower than a raw cycle counter. Unfortunately, there is no instruction to access the raw cycle counter, even though it exists!
[1] Of course there is the small problem that nearly all of the benchmarks themselves are written not in C/C++ but in x86 asm, so those naturally won't work on ARM. Still it would be easy to port most simple benchmarks over and the idea is to have C++ version of the ones that can be expressed without asm.