uarch-bench Allow dynamically loaded benchmarks from a shared object

Rather than compiling in benchmarks, it would be cool to allow benchmarks to be dynamically loaded from a shared object, allowing decoupling of the benchmark application and default benchmarks from other benchmarks.

This would need at least the following:

A mechanism to load shared objects (e.g., dlopen/dlsym and friends) and enumerate the contained benchmarks.
An API that shared benchmark objects are written against, which would be a subset of the existing uarch-bench code.
Some kind of versioning mechanism so that we don't blow up when we load benchmarks compiled against an older version of the uarch-bench API after a breaking change has been made.

Mar 14 '18 20:03 travisdowns

I'd suggest sticking to C here for more flexibility, plus not having to deal with C++ name mangling.

That said, how much API do you need? AFAICT all you would need is something like

typedef struct UarchGroup_ UarchGroup;

void uarch_group_set_description(const char* name);
void uarch_group_add_bench(UarchGroup* group,
                           const char* id,
                           const char* name,
                           long (* func)(uint64_t iterations),
                           size_t ops_per_loop);

or am I missing something?

Then you could just have a single public symbol (uarch_register_benches or something) which would accept a UarchBench* parameter, so a benchmark group (shared library) would just be something like:

#if defined(__cplusplus)
extern "C"
#endif
void uarch_register_benches(UarchGroup* group) {
  uarch_group_set_description("Population Count");

  if (__builtin_cpu_support("popcnt"))
    uarch_group_add_bench("popcount-native", "POPCNT", psnip_popcount_builtin_native, 1);
  uarch_group_add_bench("popcount-builtin", "__builtin_popcount", psnip_popcount_builtin, 1);
  uarch_group_add_bench("popcount-table", "Table-based popcount", psnip_popcount_table, 1);
  uarch_group_add_bench("popcount-twiddle", "Bit-twiddling popcount", psnip_popcount_twiddle, 1);
}

The API should be simple enough that versioning shouldn't really be an issue, but if it is you can always switch to something like uarch2_register_benches. You could also preserve compatibility by adding something like uarch_group_add_bench_with_foo(..., Foo foo)

As for enumeration, all you need to do is run through all shared libraries in a directory. If they have a uarch_register_benches symbol you run it, otherwise you dlclose() it (and maybe emit a warning).

Mar 15 '18 20:03 nemequ

@nemequ - yeah, using a C API is a good idea to avoid the versioning and other pitfalls of a C++ API.

Yeah I think the API could look something like that. The problem is that even those components haved changed a few times already, so I'm a bit reluctant to lock things down with an API - although I suppose I could say that we just don't supports backwards compatibility.

The other problem though is how the benchmarks are actually generated. See a typical benchmark file like default_benches.cpp and the ALL_TIMERS_X stuff. It's an X-macro that generates the actual benchmark method for each timer (i.e., the method is generated uniquely for each TIMER - of which there are currently only two, the clock-based one and the libpfc one). This template stuff takes care of generating the sampling code, the normalization code, and all the other stuff as well.

So a C API that just passes a long (* func)(uint64_t iterations) wouldn't be exactly equivalent. You could probably still use almost the same scaffolding but then just call the method via function pointer rather than inlining the function (or function call) directly into the benchmark method. That's not too terrible I guess.

Mar 15 '18 22:03 travisdowns

Yeah I think the API could look something like that. The problem is that even those components haved changed a few times already, so I'm a bit reluctant to lock things down with an API - although I suppose I could say that we just don't supports backwards compatibility.

I haven't dug into those macros yet, but if you can hide all that stuff you'll have a much easier time of maintaining the API even as the underlying implementation changes. I guess you structured it that way to avoid the overhead of an extra function call?

Mar 15 '18 23:03 nemequ

Yeah the idea is that with the template-based generation mechanism, everything can potentially be inlined into the innermost benchmark and we can avoid overhead of function calls, or any other junk that might leak into the "measured region".

In practice, I've written most of my benchmarks in assembly separately compiled by nasm, so that always implies a function call anyways, so the savings isn't as a big as for a benchmark written in C++ that can be completely inlined (in that case though you have to be careful to "sink" the result so the optimizer doesn't defeat your benchmark).

So I think in practice you for the API could just make an indirect call (i.e., a call though a function pointer) which isn't much worse than the non-indirect call we are making today for asm benchmarks. Yes, you'll suffer a branch misprediction on at least the time through, but we do a few warmup runs to stabilize these effects. We also try to remove the overheads through delta measurements, which is a whole separate and interesting topic.

Right now I have new "mode" not yet committed called "one shot" which is very different from the existing strategy of (1) doing warmup iterations and (2) generally measuring with many interations and taking the min/median of the results. Instead, one shot only calls the function exactly once in the measured region and it may not have any iterations internally, and the times/counters are reported, and this can be repeated a few times. So it can capture cold effects, and you can also see how the 1st try differs from the second, and you can potentially measure transient effects. I'm using this to actually dig more into some CPU uarch details. I mention it mostly because here I'm actually finding that the function call to asm code is problematic so I introduced another mode where you can inline the measurement code directly in the asm code. That approach is one that could work for benchmarks in a shared object: make the person writing the benchmark inline the measurement code.

Mar 16 '18 03:03 travisdowns

uarch-bench uarch-bench copied to clipboard

Allow dynamically loaded benchmarks from a shared object

uarch-bench
uarch-bench copied to clipboard