bencher icon indicating copy to clipboard operation
bencher copied to clipboard

feature request: Per-benchmark thresholds

Open ollie-etl opened this issue 1 year ago • 3 comments

Currently thresholds are specified per testbed, measure, and branch. I'm requesting that it is also extend to include (maybe optional) benchmark

if I generate report which looks like this

{   "benchmark 1": {
            "foo": 17.0,
            "bar": 42.0
     },
     "benchmark 2": {
            "foo": 450000.0,
            "bar": 10000000.0
     }
}

Lets say I'd like to be able to determine that "benchmark 1". "foo" doesn't exceed 20, and "benchmark 2"."foo"` doesn't exceed 50000. I currently can't, or at least it would require normalising the results. If the units have physical significance, this isn't desirable.

ollie-etl avatar Oct 18 '24 17:10 ollie-etl

@ollie-etl you are correct, a Threshold is currently tied to the combination of Branch, Testbed, and Measure and then it is applied to all Benchmarks within that set.

Lets say I'd like to be able to determine that "benchmark 1". "foo" doesn't exceed 20, and "benchmark 2". "foo" doesn't exceed 50000.

This should currently be possible using a Percentage Test (percentage) with an Upper Boundary set to 0.0. Likewise, you can also guarantee it doesn't go below that set value by setting a Lower Boundary of 0.0.

You will also want to set the Max Sample Size to 2 so that it is only ever comparing against the most recent historical result.

Better filtering for which Benchmarks to apply a Threshold to is something that I've been putting some thought towards: https://github.com/bencherdev/bencher/issues/366 However, I think the above solution should be sufficient to handle your example use case.

epompeii avatar Oct 19 '24 01:10 epompeii

One issue I have is that different benchmarks also have a differing amount of variance. It seems under the current model I would need to define one Measure per benchmark, to fine-tune the threshold to the variance observed on that benchmark (Currently only the percentage and static model seem to fit the use-case of checking for perf regressions before merging a feature branch).

jschwe avatar Jul 21 '25 03:07 jschwe

@jschwe how would you like for this to work?

One possibility is that Thresholds could be updated to have zero to many Models. These Models could then be tied to regex like patterns to see if they ought to be applied to a particular Benchmark. The default would be something like * or .*.

There could also be an exclusion pattern. I'm not sure if this would be preferable over a static list, as discussed in the previous comment.

There is also the open question of whether all matching patterns should have their Model tested or just the first to match.

https://github.com/bencherdev/bencher/issues/366#issuecomment-2423417150

epompeii avatar Jul 23 '25 02:07 epompeii