BenchmarkTools.jl
BenchmarkTools.jl copied to clipboard
[RFC/WIP] Tools for measuring cycles and cpu_times and tricking out LLVM
I recently started exploring options for more precise and low-level benchmarking tools.
As it is this PR is notready to be included in BenchmarkTools, but should provide a starting point for discussions.
-
clobber()andescape()Two methods to prevent certain compiler optimisations on the LLVM level. (see https://youtu.be/nXaxk27zwlk?t=2441)clobber()is a memory barrier that forces the compiler to flush all writes to memory andescapeis an method to prevent LLVM from optimising a value away since we are faking a store of it.escape()is not quite done since it can't handel boxed values and it would be easier to write if we could depend on LLVM.jl -
bench_start()andbench_end()Inspired by https://github.com/dterei/gotsc and https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html Since CPUs can do speculative execution reordering and a bunch of other shenanigans this is a very careful series of instructions that tries to prevent as much of that as possible and thus should give a as precise as possible estimate of the number of cycles it takes for a block of code to run. These instructions are not completely noise free since we still are running in user-space and the current implementation is x86_64 only (and requires a series of processor features). It is also tricky to convert cycles to time spend. If we use this method it should be opt-in and we need to method variance and overhead. -
getProcessTime()andgetThreadTime()I got curious and looked into what google/benchmark is using for time measurement and it turns out they actual measure two things. run time and cpu time, where the latter is the time that a process is actually spend being run. The current implementation is Linux only but can get extended to to all platforms we care about. For runtime measurement they uses http://en.cppreference.com/w/cpp/chrono/high_resolution_clock. Currently we are usinguv_hrtimefromlibuv. Bothuv_hrtimeand the c++ timer will under Unix fall back toclock_gettime(CLOCK_MONOTONIC, ...)similar to my implementation ofgetProcessTime.
What should we do?
I think taking a lead from google/benchmark and also measuring CPU time vs just runtime would be a first good actionable item. I am much
less sure about what to do with 1. and 2. and if they are useful for BenchmarkTools.jl, that needs further evaluation and for that I currently don't have time.
It is also tricky to convert cycles to time spend. If we use this method it should be opt-in and we need to method variance and overhead.
Cycles spent is an extremely relevant metric in itself, often far more relevant than times. So I'd say, measure and report both, as well as the implied measured frequency. This can serve as a reality check for users (if the reported frequency differs a lot from the official frequency, then we probably have a lot of measurement error). Also, when interpreting results, every relevant resource is normally counted in clock cycles anyway (instruction costs, cache-miss penalties, memory fetches, branch mispredicts, etc). Say you do some computations with N logical steps; then you always want to count how many OP/cycle, and this tells you roughly how good your code is (large number: few bookkeeping instructions, good use of memory and ILP; small number: figure out the problem).
Converting cycles to nanoseconds is bad; if any conversion makes sense, then it is nanoseconds -> cycles. By reporting measured frequency, the user is also empowered to spot problems like frequency drop due to AVX2, etc (some CPUs scale down frequency when some vector instructions are used).
Do you know of anyway to measure cycles in a platform portable way (e.g.) something that works for ARM and PPC?
Originally I went forward with https://github.com/JuliaCI/BenchmarkTools.jl/pull/94 since cputime is an important measure as well (how much time did we actually spent in a program and not sleeping/in the kernel). I agree that cycle benchmarking has its place and is an important tool, but I am not convinced that a general framework such as BenchmarkTools is the right place for it (maybe we need a LowlevelBenchmarkTools package.) Since when measuring cycles you want to tightly control the code executed before and after the region of interest and any that introduces overhead that will throw off any other timing measurements.
Anyway I won't have time to work on either, so I would happy if someone could pick this up and bring it to conclusion.
So one of the things that has me come back to this PR is that https://perf.rust-lang.org/ defaults to instructions and cycles,
as well as http://llvm-compile-time-tracker.com/
But maybe the better pathway is to use LinuxPerf.jl to build that infrastructure.