benchmark icon indicating copy to clipboard operation
benchmark copied to clipboard

[FR] Support compare.py benchmarks a.json b.json --benchmark_filter=blah

Open matta opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe.

I've got an a.json and a b.json that took a while to collect due to lots of benchmarks and lots of repetitions. I was surprised that

compare.py benchmarks a.json b.json --benchmark_filter=blah

Ignored --benchmarks_filter= and always displayed all the results.

Describe the solution you'd like

The "benchmarks" sub-command honors --benchmarks_filter= and reports as if only those benchmarks were run.

Describe alternatives you've considered

Post processing the benchmark .json before running compare.py is an inconvenient workaround.

matta avatar Sep 08 '22 19:09 matta

This is already supported: https://github.com/google/benchmark/blob/6e32352c79dac230861bf601a66fb53815efa864/tools/compare.py#L353-L420

LebedevRI avatar Sep 08 '22 19:09 LebedevRI

Hmm, I'm having trouble getting a satisfying result using benchmarksfiltered.

I'm trying this kind of thing:

compare.py benchmarksfiltered a.json $REGEX b.json $REGEX

Let's say I want to benchmark only benchmarks where the last arg is 0..9999. Let's say that represents a size. When running a C++ benchmark directly I'd pass this: --benchmarks_filter=/[0-9]{1,3}$. The PR I sent as #1485 allowed me to pass the same regex to compare.py to compare only those benchmarks across multiple runs.

To get behavior comparable to #1485 I have to use a comparatively confusing regex: /[0-9]{1,3}([^0-9]|$). Why? because the regex is matching against benchmark.name, which in the case of repeated runs might have suffixes like _stdev. A more explicit regex would be /[0-9]{1,3}($|(_mean|_median|_stddev|_cv)).

The problem with that is the output is ugly:

BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]                          +0.2571         +0.2571             2             2             2             2
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]                          +0.2566         +0.2567             2             2             2             2
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]                          -0.0076         -0.0076             3             3             3             3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]                          -0.0092         -0.0078             3             3             3             3

From those timings I can guess that the first two results are probably run with different parameterizations, but the regex has clobbered that info. This clobbering also causes the reporter to collapse all the aggregate results together, which is a nonsense result.

BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]_pvalue                    0.6015          0.6015      U Test, Repetitions: 60 vs 60
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]mean                      +0.2473         +0.2473             2             2             2             2
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]mean                      -0.0046         -0.0045             3             3             3             3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]mean                      -0.1241         -0.1241             4             3             4             3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]mean_pvalue                1.0000          1.0000      U Test, Repetitions: 3 vs 3. WARNING: Results unreliable! 9+ repetitions recommended.
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]median                    +0.2472         +0.2472             2             2             2             2
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]median                    -0.0066         -0.0066             3             3             3             3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]median                    -0.1240         -0.1240             4             3             4             3
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]median_pvalue              1.0000          1.0000      U Test, Repetitions: 3 vs 3. WARNING: Results unreliable! 9+ repetitions recommended.
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]stddev                    +2.4894         +2.4941             0             0             0             0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]stddev                    +4.9339         +5.4636             0             0             0             0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]stddev                    +3.8236         +3.8372             0             0             0             0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]stddev_pvalue              0.1000          0.1000      U Test, Repetitions: 3 vs 3. WARNING: Results unreliable! 9+ repetitions recommended.
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]cv                        +1.7976         +1.8013             0             0             0             0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]cv                        +4.9612         +5.4928             0             0             0             0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]cv                        +4.5073         +4.5227             0             0             0             0
BM_LowerBound/Ideal_Random[/[0-9]{1,3}($|_) vs. /[0-9]{1,3}($|_)]cv_pvalue                  0.1000          0.1000      U Test, Repetitions: 3 vs 3. WARNING: Results unreliable! 9+ repetitions recommended.

By contrast the output of my PR #1485 produces sensible results with --benchmark_filter= (example edited for brevity)

Comparing ./results/clang.json to ./results/gcc.json
Benchmark                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------
BM_LowerBound/Ideal_Random/10                                +0.2378         +0.2379             2             2             2             2
BM_LowerBound/Ideal_Random/10                                +0.2696         +0.2696             2             2             2             2
BM_LowerBound/Ideal_Random/10                                +0.2530         +0.2530             2             2             2             2
BM_LowerBound/Ideal_Random/10_pvalue                          0.0000          0.0000      U Test, Repetitions: 20 vs 20
BM_LowerBound/Ideal_Random/10_mean                           +0.2473         +0.2473             2             2             2             2
BM_LowerBound/Ideal_Random/10_median                         +0.2472         +0.2472             2             2             2             2
BM_LowerBound/Ideal_Random/10_stddev                         +2.4894         +2.4941             0             0             0             0
BM_LowerBound/Ideal_Random/10_cv                             +1.7976         +1.8013             0             0             0             0
BM_LowerBound/Ideal_Random/50                                -0.0140         -0.0140             3             3             3             3
BM_LowerBound/Ideal_Random/50                                -0.0081         -0.0081             3             3             3             3
BM_LowerBound/Ideal_Random/50                                -0.0032         -0.0032             3             3             3             3
BM_LowerBound/Ideal_Random/50_pvalue                          0.0071          0.0071      U Test, Repetitions: 20 vs 20
BM_LowerBound/Ideal_Random/50_mean                           -0.0046         -0.0045             3             3             3             3
BM_LowerBound/Ideal_Random/50_median                         -0.0066         -0.0066             3             3             3             3
BM_LowerBound/Ideal_Random/50_stddev                         +4.9339         +5.4636             0             0             0             0
BM_LowerBound/Ideal_Random/50_cv                             +4.9612         +5.4928             0             0             0             0
BM_LowerBound/Ideal_Random/100                               -0.1096         -0.1096             4             4             4             4
BM_LowerBound/Ideal_Random/100                               -0.1310         -0.1310             4             3             4             3
BM_LowerBound/Ideal_Random/100                               -0.1237         -0.1237             4             3             4             3
BM_LowerBound/Ideal_Random/100_pvalue                         0.0000          0.0000      U Test, Repetitions: 20 vs 20
BM_LowerBound/Ideal_Random/100_mean                          -0.1241         -0.1241             4             3             4             3
BM_LowerBound/Ideal_Random/100_median                        -0.1240         -0.1240             4             3             4             3
BM_LowerBound/Ideal_Random/100_stddev                        +3.8236         +3.8372             0             0             0             0
BM_LowerBound/Ideal_Random/100_cv                            +4.5073         +4.5227             0             0             0             0
OVERALL_GEOMEAN                                              -0.2366         -0.2366             0             0             0             0

I'll play around with benchmarksfiltered to see if I can make it do sensible things when both regex are equal.

matta avatar Sep 09 '22 18:09 matta

Hmm...am I missing something? I can't figure out how to re-open the issue myself.

matta avatar Sep 09 '22 18:09 matta

I see, so the feedback is that those two "filter" are conflated and do different things.

LebedevRI avatar Sep 09 '22 18:09 LebedevRI

Yes, I think so. The benchmarksfiltered filters are intended to collapse matches into a single "family", whereas --benchmark_filtered= merely filters the benchmarks but does not change their names.

I thought about changing benchmarksfiltered to go into a different "filter but do not collapse" mode when the regex used for both files are equal. But perhaps there are useful use cases for this collapsing behavior?

I will re-send my original PR.

matta avatar Sep 09 '22 19:09 matta

One nice thing about supporting --benchmarks_filter is that you can do this transformation with no change in behavior:

compare.py orig.json candidate-binary --benchmarks_filter $REGEX

to

compare.py orig.json candidate.json --benchmarks_filter $REGEX

matta avatar Sep 09 '22 19:09 matta