BenchmarkDotNet
BenchmarkDotNet copied to clipboard
Incorrectly detected number of invocations
Currently, we have the following heuristic: if the duration of the first WorkloadJitting iteration is more than IterationTime (which is 500ms by default), we skip the Pilot stage and perform a single invocation in further iterations.
This heuristic doesn't always work well, so in this scope of #1573, we have introduced another heuristic: if the duration of the first WorkloadJitting iteration takes more than IterationTime, we but less than 1 second (a magic number), we discard the first results and perform another WorkloadJitting iteration.
Unfortunately, it doesn't always work well either. If we try to run these F# benchmarks (the problem was originally reported here by Matthew Crews), we could observe the following situation (the full log is available here):
WorkloadJitting 1: 1 op, 2660880000.00 ns, 2.6609 s/op
WorkloadWarmup 1: 1 op, 55999700.00 ns, 55.9997 ms/op
WorkloadWarmup 2: 1 op, 48296500.00 ns, 48.2965 ms/op
WorkloadWarmup 3: 1 op, 47339800.00 ns, 47.3398 ms/op
WorkloadWarmup 4: 1 op, 37181000.00 ns, 37.1810 ms/op
WorkloadWarmup 5: 1 op, 37303500.00 ns, 37.3035 ms/op
WorkloadWarmup 6: 1 op, 38082300.00 ns, 38.0823 ms/op
WorkloadWarmup 7: 1 op, 37534700.00 ns, 37.5347 ms/op
WorkloadWarmup 8: 1 op, 39494000.00 ns, 39.4940 ms/op
WorkloadWarmup 9: 1 op, 37381700.00 ns, 37.3817 ms/op
While the typical duration of the warmed benchmark is about 35-45ms, the first invocation takes more than 2.5 seconds. This situation leads to reasonable warnings:
I believe that the ultimate solution is to introduce an ability to restart the Pilot stage in the case of "too fast" WorkloadWarmup iterations. However, it would require a major refactoring of our Engine infrastructure. As a quick hotfix, we could just bump the magic number from 1 second to a higher value (let's say 10 seconds). The worst cast side effect from such a change is extra 10 seconds of the total benchmarking time for heavy benchmarks (that actually take more than 500ms, but less than 10 seconds).
@adamsitnik, what do you think?
Is this going to be released in a new version soon? Don't mean to pressure 😊, just curious since I would like to benchmark some code that would run into this issue.
I have encountered this phenomenon several times, mainly on code with static initialization, and it's pretty annoying.
See notes on how I ran into this when evaluating OSR: https://github.com/dotnet/runtime/issues/33658#issuecomment-1068293937 (and following).
Without OSR the initial workload run takes > 250 ms, so it gets a second chance (in dotnet/perf the goal time is set to 250ms); the second number is considerably lower so BDN runs pilot stages and settles on ~3K operations per workload interval.
With OSR the initial run drops to 220 MS, this isn't slow enough to trigger the second chance but is slow enough to convince BDN not to do pilot stages.
Chart below is the number of operations per workload interval over time for this test, you can see it greatly varies...

There is a second part to this heuristic. Say the goal is to run for 250 ms, and the initial run comes in at 220 ms like above. We're not slow enough for the second chance, but then we fail the 250<220 < 1.5 test right after this, so we're too long to do a pilot stage.
In the tests like the above if the initial jit time crosses back and forth in the neighborhood of 250ms with BDN invocations over time, then the number of iterations fluctuates and this leads to wildly fluctuating perf numbers too.
For example (sorry the x-axes don't quite line up here, but you get the idea):
(upper is reported perf, lower is number of operations, for a benchmark over time (

If nothing else this second heuristic should have the same threshold as the first; as it is we have an odd band between roughly 167ms (~ 250/1.5) and 250 ms where running faster OR slower than this can greatly alter the iteration plan.
This also impacts people like me that may use a checked build of the jit or enable some other jit diagnostic output. Because this causes the jit to run more slowly, can ends up boosting the initial iteration time out of this (167, 250) zone to some value above 250, and magically the benchmark seems to run faster.
Could be related: https://github.com/dotnet/perf-autofiling-issues/issues/3990 (System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Mariomkas)