criterion.rs
criterion.rs copied to clipboard
Allow for Post-benchmarking Summarization
Criterion has been a delight to use so far in my chess engine project, pleco. I'm currently in the process of transitioning to Criterion from the standard library benchmarking.
One thing I think Criterion is missing is a way to display results concisely. The standard library benchmarks simply outputted a number per benchmarked function, making it very easy to see the exact time for each benchmark. Criterion spits out large blobs of text, and while the information is very informative, it's very hard to quickly read through.
Perhaps an (optional) post benchmarking summary could be of use. After every benchmark has been ran, Criterion could spit out a summary of each benchmark result, similar to the standard library.
Hey, thanks for the suggestion. I'm glad you like Criterion.
The existing output format is intended to be easily readable already, but I agree it could do that better.
As a work-around to this, I've been using something like this:
$ cargo bench -- --verbose | tee bench.log | rg 'time:|thrpt:'
that pretty closely captures what the standard library's benchmark harness will show you. It does use two lines instead of one, but it has much less noise and still saves the more informative output to disk for closer scrutiny.
I'm curious - which parts of the output do you see as noise and why? The outlier information is not often useful, I suppose. I rely on the percentage-changed and regressed/unchanged/improved display pretty heavily, but you've filtered those out in your example.
@bheisler Great question! So, firstly, I want to say that "noise" was probably a poor word choice on my part. I actually very much like all of the information printed by criterion. It's extremely useful context, and it's why I run with --verbose
to get the extra bits about standard deviation, which is also really useful.
However, sometimes I have a lot of benchmarks. For the project I'm working on right now, I count 69 of them, and I actually suspect that might grow. Benchmarks can grow rapidly because of combinatorial factors in the thing you're trying to measure (input size, algorithm). So basically, when I do a benchmark run, what I'd like to be able to do is see a dense bird's eye view of what happen. Ideally, each benchmark would occupy about one line. If I could pick precisely the information I'd want on each line, I think I'd choose something like this:
- Benchmark name (including group name).
- Average time to run.
- Average throughput.
- Standard deviation.
- If comparing against a baseline, then percent change in average time.
- Color could be used to indicate whether criterion detects a statistically significant improvement (green) or regression (red).
That way, I could very quickly scan all of the benchmarks to get a general idea of what happened. If there was no percent change, then I probably don't need to investigate too much more. But if there is, then I can go look at the report criterion currently generates and/or the graphs, which are really really useful.
Hopefully that clarifies things! I'm happy to answer more questions.
I also would like to be able to emit a "summarized" output like cargo bench
does, or maybe even something in between that and cargo bench-cmp
(e.g. when comparing against a baseline). Ideally with some CLI outputs to control what to plot (e.g. I often only care about how fast the fastest invocation was, instead of being interested in the mean).
Would also like this. One note is that I've been using something akin to @BurntSushi's
$ cargo bench -- --verbose | tee bench.log | rg 'time:|thrpt:'
but if the benchmark name is too long, it's no longer on the same line as time:
.
In terms of motivation - one use for me is to look at a pattern over a set of parameterized benchmarks - i.e. something like "at what vector size does this operation fall out of cache and slow down", and it's easiest to do that when there's just one vertical column of numbers to compare. (without delving into plots and so on)
I forgot to update this issue, but I've mostly addressed my desire here with a tool that read's criterion's output: https://github.com/BurntSushi/critcmp
@bheisler I have noticed that criterion provides a Report
trait. Should we add method for adding report methods into reports list? Maybe simply:
impl Criterion {
fn add_report(&mut self, report: Box<dyn Report>) {
self.reports.push(report);
}
}
impl Reports {
fn push(&mut self, report: Box<dyn Report>) {
self.reports.push(report);
}
}
The only doubt is that whether Report
trait is stable. But I don't think it's more important (than the benefits provided by this function).
The Report trait is not stable, no. The main thing holding this up is the design work necessary to define a trait that makes sense for reports.
Just adding in a shell script I use that may be helpful to others. It just wraps the output of cargo bench
, but names the file based on the git describe (e.g., v2.1.2_2022-08-23_1647.bench
) and adds CPU info, which is helpful if you need to send/archive benchmarks that may have been run on different machines.
#!/bin/sh
# need path-safe UTC datetime
dtime=$(date +"%Y-%m-%d_%H%M" --utc)
describe=$(git describe --always --tags)
fname="benches/results/${describe}_${dtime}.bench"
# Print CPU information to the file
cmd="echo Benchmark from $dtime on commit $describe;"
cmd=${cmd}"rustc --version;"
cmd=${cmd}"printf '\n';"
cmd=${cmd}"echo CPU information:;"
cmd=${cmd}"lscpu | grep -E 'Architecture|Model name|Socket|Thread|CPU\(s\)|MHz';"
cmd=${cmd}"printf '\n\n\n';"
cmd=${cmd}"cargo bench $*;"
eval "$cmd" | tee "$fname"