benchmark
benchmark copied to clipboard
[FR] Support compare.py benchmarks a.json b.json --benchmark_filter=blah
Is your feature request related to a problem? Please describe.
I've got an a.json and a b.json that took a while to collect due to lots of benchmarks and lots of repetitions. I was surprised that
compare.py benchmarks a.json b.json --benchmark_filter=blah
Ignored --benchmarks_filter=
and always displayed all the results.
Describe the solution you'd like
The "benchmarks" sub-command honors --benchmarks_filter=
and reports as if only those benchmarks were run.
Describe alternatives you've considered
Post processing the benchmark .json before running compare.py is an inconvenient workaround.
This is already supported: https://github.com/google/benchmark/blob/6e32352c79dac230861bf601a66fb53815efa864/tools/compare.py#L353-L420
Hmm, I'm having trouble getting a satisfying result using benchmarksfiltered
.
I'm trying this kind of thing:
compare.py benchmarksfiltered a.json $REGEX b.json $REGEX
Let's say I want to benchmark only benchmarks where the last arg is 0..9999. Let's say that represents a size. When running a C++ benchmark directly I'd pass this: --benchmarks_filter=/[0-9]{1,3}$
. The PR I sent as #1485 allowed me to pass the same regex to compare.py
to compare only those benchmarks across multiple runs.
To get behavior comparable to #1485 I have to use a comparatively confusing regex: /[0-9]{1,3}([^0-9]|$)
. Why? because the regex is matching against benchmark.name, which in the case of repeated runs might have suffixes like _stdev
. A more explicit regex would be /[0-9]{1,3}($|(_mean|_median|_stddev|_cv))
.
The problem with that is the output is ugly:
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)] +0.2571 +0.2571 2 2 2 2
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)] +0.2566 +0.2567 2 2 2 2
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)] -0.0076 -0.0076 3 3 3 3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)] -0.0092 -0.0078 3 3 3 3
From those timings I can guess that the first two results are probably run with different parameterizations, but the regex has clobbered that info. This clobbering also causes the reporter to collapse all the aggregate results together, which is a nonsense result.
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]_pvalue 0.6015 0.6015 U Test, Repetitions: 60 vs 60
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]mean +0.2473 +0.2473 2 2 2 2
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]mean -0.0046 -0.0045 3 3 3 3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]mean -0.1241 -0.1241 4 3 4 3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]mean_pvalue 1.0000 1.0000 U Test, Repetitions: 3 vs 3. WARNING: Results unreliable! 9+ repetitions recommended.
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]median +0.2472 +0.2472 2 2 2 2
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]median -0.0066 -0.0066 3 3 3 3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]median -0.1240 -0.1240 4 3 4 3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]median_pvalue 1.0000 1.0000 U Test, Repetitions: 3 vs 3. WARNING: Results unreliable! 9+ repetitions recommended.
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]stddev +2.4894 +2.4941 0 0 0 0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]stddev +4.9339 +5.4636 0 0 0 0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]stddev +3.8236 +3.8372 0 0 0 0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]stddev_pvalue 0.1000 0.1000 U Test, Repetitions: 3 vs 3. WARNING: Results unreliable! 9+ repetitions recommended.
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]cv +1.7976 +1.8013 0 0 0 0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]cv +4.9612 +5.4928 0 0 0 0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]cv +4.5073 +4.5227 0 0 0 0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]cv_pvalue 0.1000 0.1000 U Test, Repetitions: 3 vs 3. WARNING: Results unreliable! 9+ repetitions recommended.
By contrast the output of my PR #1485 produces sensible results with --benchmark_filter=
(example edited for brevity)
Comparing ./results/clang.json to ./results/gcc.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
--------------------------------------------------------------------------------------------------------------------------------------------
BM_LowerBound/Ideal_Random/10 +0.2378 +0.2379 2 2 2 2
BM_LowerBound/Ideal_Random/10 +0.2696 +0.2696 2 2 2 2
BM_LowerBound/Ideal_Random/10 +0.2530 +0.2530 2 2 2 2
BM_LowerBound/Ideal_Random/10_pvalue 0.0000 0.0000 U Test, Repetitions: 20 vs 20
BM_LowerBound/Ideal_Random/10_mean +0.2473 +0.2473 2 2 2 2
BM_LowerBound/Ideal_Random/10_median +0.2472 +0.2472 2 2 2 2
BM_LowerBound/Ideal_Random/10_stddev +2.4894 +2.4941 0 0 0 0
BM_LowerBound/Ideal_Random/10_cv +1.7976 +1.8013 0 0 0 0
BM_LowerBound/Ideal_Random/50 -0.0140 -0.0140 3 3 3 3
BM_LowerBound/Ideal_Random/50 -0.0081 -0.0081 3 3 3 3
BM_LowerBound/Ideal_Random/50 -0.0032 -0.0032 3 3 3 3
BM_LowerBound/Ideal_Random/50_pvalue 0.0071 0.0071 U Test, Repetitions: 20 vs 20
BM_LowerBound/Ideal_Random/50_mean -0.0046 -0.0045 3 3 3 3
BM_LowerBound/Ideal_Random/50_median -0.0066 -0.0066 3 3 3 3
BM_LowerBound/Ideal_Random/50_stddev +4.9339 +5.4636 0 0 0 0
BM_LowerBound/Ideal_Random/50_cv +4.9612 +5.4928 0 0 0 0
BM_LowerBound/Ideal_Random/100 -0.1096 -0.1096 4 4 4 4
BM_LowerBound/Ideal_Random/100 -0.1310 -0.1310 4 3 4 3
BM_LowerBound/Ideal_Random/100 -0.1237 -0.1237 4 3 4 3
BM_LowerBound/Ideal_Random/100_pvalue 0.0000 0.0000 U Test, Repetitions: 20 vs 20
BM_LowerBound/Ideal_Random/100_mean -0.1241 -0.1241 4 3 4 3
BM_LowerBound/Ideal_Random/100_median -0.1240 -0.1240 4 3 4 3
BM_LowerBound/Ideal_Random/100_stddev +3.8236 +3.8372 0 0 0 0
BM_LowerBound/Ideal_Random/100_cv +4.5073 +4.5227 0 0 0 0
OVERALL_GEOMEAN -0.2366 -0.2366 0 0 0 0
I'll play around with benchmarksfiltered
to see if I can make it do sensible things when both regex are equal.
Hmm...am I missing something? I can't figure out how to re-open the issue myself.
I see, so the feedback is that those two "filter" are conflated and do different things.
Yes, I think so. The benchmarksfiltered
filters are intended to collapse matches into a single "family", whereas --benchmark_filtered=
merely filters the benchmarks but does not change their names.
I thought about changing benchmarksfiltered
to go into a different "filter but do not collapse" mode when the regex used for both files are equal. But perhaps there are useful use cases for this collapsing behavior?
I will re-send my original PR.
One nice thing about supporting --benchmarks_filter is that you can do this transformation with no change in behavior:
compare.py orig.json candidate-binary --benchmarks_filter $REGEX
to
compare.py orig.json candidate.json --benchmarks_filter $REGEX