MinIterationTime warning occurred when using `IterationTime:250ms` setting.
I'm trying to using dotnet/performance repository's RecommendedConfig settings to reduce benchmark times.
But when using IterationTime:250ms setting.
Following warning are randomly raised for some benchmarks.
// * Warnings *
MinIterationTime
DummyBenchmark.RunDummyBenchmark: Toolchain=.NET 9.0, IterationTime=250ms, LaunchCount=1, MaxIterationCount=20, MinIterationCount=15, RunStrategy=Throughput, WarmupCount=25 -> The minimum observed iteration time is 74.728ms which is very small. It's recommended to increase it to at least 100ms using more operations.
This warning is raised, because iteration time is expected to be around 250ms.
But it actual execution time is 74.728ms and it's lower than MinIterationTimeAnalyser's threshold (100ms)
Problem Details
IterationTime setting is used by WorkloadPilot phase to estimate PerfectInvocationCount
But from benchmark logs. It seems JIT process is not sufficient and underestimate PerfectInvocationCount.
(It expected to be around 250ms with 80ops but 75ms on actual run)
- It's occurred when using .NET 9 toolchain. And it's not occurred with .NET8 toolchain.
- Increasing warmup count is not works because warmup is executed after
WorkloadPilotphase
Benchmark Logs
IterationTime=250ms, LaunchCount=1, MaxIterationCount=20, MinIterationCount=15, RunStrategy=Throughput, WarmupCount=25
OverheadJitting 1: 1 op, 110200.00 ns, 110.2000 us/op
WorkloadJitting 1: 1 op, 5121600.00 ns, 5.1216 ms/op
OverheadJitting 2: 16 op, 204700.00 ns, 12.7938 us/op
WorkloadJitting 2: 16 op, 61949900.00 ns, 3.8719 ms/op
WorkloadPilot 1: 16 op, 61489600.00 ns, 3.8431 ms/op
WorkloadPilot 2: 80 op, 303822700.00 ns, 3.7978 ms/op
OverheadWarmup 1: 80 op, 2300.00 ns, 28.7500 ns/op
OverheadWarmup 2: 80 op, 600.00 ns, 7.5000 ns/op
OverheadWarmup 3: 80 op, 1000.00 ns, 12.5000 ns/op
OverheadWarmup 4: 80 op, 600.00 ns, 7.5000 ns/op
OverheadWarmup 5: 80 op, 500.00 ns, 6.2500 ns/op
OverheadWarmup 6: 80 op, 700.00 ns, 8.7500 ns/op
OverheadWarmup 7: 80 op, 300.00 ns, 3.7500 ns/op
OverheadActual 1: 80 op, 800.00 ns, 10.0000 ns/op
...
OverheadActual 20: 80 op, 800.00 ns, 10.0000 ns/op
WorkloadWarmup 1: 80 op, 268831700.00 ns, 3.3604 ms/op
WorkloadWarmup 2: 80 op, 74450400.00 ns, 930.6300 us/op
WorkloadWarmup 3: 80 op, 73728100.00 ns, 921.6013 us/op
WorkloadWarmup 4: 80 op, 76103400.00 ns, 951.2925 us/op
WorkloadWarmup 5: 80 op, 76514800.00 ns, 956.4350 us/op
WorkloadWarmup 6: 80 op, 75428500.00 ns, 942.8563 us/op
WorkloadWarmup 7: 80 op, 74758200.00 ns, 934.4775 us/op
WorkloadWarmup 8: 80 op, 75704000.00 ns, 946.3000 us/op
WorkloadWarmup 9: 80 op, 76330000.00 ns, 954.1250 us/op
WorkloadWarmup 10: 80 op, 75625400.00 ns, 945.3175 us/op
WorkloadWarmup 11: 80 op, 74481300.00 ns, 931.0163 us/op
WorkloadWarmup 12: 80 op, 75432200.00 ns, 942.9025 us/op
WorkloadWarmup 13: 80 op, 75695300.00 ns, 946.1913 us/op
WorkloadWarmup 14: 80 op, 75814400.00 ns, 947.6800 us/op
WorkloadWarmup 15: 80 op, 75705000.00 ns, 946.3125 us/op
WorkloadWarmup 16: 80 op, 74354000.00 ns, 929.4250 us/op
WorkloadWarmup 17: 80 op, 75110700.00 ns, 938.8838 us/op
WorkloadWarmup 18: 80 op, 74633700.00 ns, 932.9213 us/op
WorkloadWarmup 19: 80 op, 76451700.00 ns, 955.6463 us/op
WorkloadWarmup 20: 80 op, 75279300.00 ns, 940.9913 us/op
WorkloadWarmup 21: 80 op, 75362900.00 ns, 942.0363 us/op
WorkloadWarmup 22: 80 op, 76845000.00 ns, 960.5625 us/op
WorkloadWarmup 23: 80 op, 73960300.00 ns, 924.5038 us/op
WorkloadWarmup 24: 80 op, 75225200.00 ns, 940.3150 us/op
WorkloadWarmup 25: 80 op, 76821200.00 ns, 960.2650 us/op
// BeforeActualRun
WorkloadActual 1: 80 op, 75935300.00 ns, 949.1913 us/op
WorkloadActual 2: 80 op, 75302800.00 ns, 941.2850 us/op
WorkloadActual 3: 80 op, 75230900.00 ns, 940.3863 us/op
WorkloadActual 4: 80 op, 77478000.00 ns, 968.4750 us/op
WorkloadActual 5: 80 op, 75984200.00 ns, 949.8025 us/op
WorkloadActual 6: 80 op, 74728400.00 ns, 934.1050 us/op
WorkloadActual 7: 80 op, 76113100.00 ns, 951.4138 us/op
WorkloadActual 8: 80 op, 75724500.00 ns, 946.5563 us/op
WorkloadActual 9: 80 op, 76461300.00 ns, 955.7663 us/op
WorkloadActual 10: 80 op, 77062700.00 ns, 963.2838 us/op
WorkloadActual 11: 80 op, 75235800.00 ns, 940.4475 us/op
WorkloadActual 12: 80 op, 76656800.00 ns, 958.2100 us/op
WorkloadActual 13: 80 op, 75006700.00 ns, 937.5838 us/op
WorkloadActual 14: 80 op, 75713500.00 ns, 946.4188 us/op
WorkloadActual 15: 80 op, 76725100.00 ns, 959.0638 us/op
BenchmarkCode
static void Main(string[] args)
{
var config = ManualConfig.Create(DefaultConfig.Instance);
config.AddDiagnoser(DefaultConfig.Instance.GetDiagnosers().ToArray());
config.AddJob(
Job.Default
.WithLaunchCount(1)
.WithWarmupCount(25)
.WithIterationTime(TimeInterval.FromMilliseconds(250)) // Default: 500ms
.WithMinIterationCount(15) // Default: 15
.WithMaxIterationCount(20) // Default: 100
.WithToolchain(CsProjCoreToolchain.NetCoreApp90)
//.WithEnvironmentVariable("DOTNET_TC_AggressiveTiering", "1")
.WithStrategy(RunStrategy.Throughput));
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config: config);
}
public class DummyBenchmark
{
private static readonly Consumer consumer = new Consumer();
[Benchmark]
public void RunDummyBenchmark()
{
foreach (var i in Enumerable.Range(1, 1_000_000))
consumer.Consume(i);
}
}
See also #1210
A simple workaround is to ask the Pilot stage to complete several iterations with the appropriate iteration time. However, this may not work reliably since we cannot control how tiered JIT behaves.
A more reliable solution could involve merging the Pilot stage with the Warmup stage. This would allow the Warmup stage to calibrate the optimum invocation count on the fly. With tiered JIT, this approach seems more reasonable, as we will not be able to accurately detect the proper invocation count before the warmup is finished.
@timcassell what do you think?
Merging pilot and warmup into a single stage sounds like a good idea to me.
In addition to that (or separately), I actually think we should adjust the JIT stage to invoke 30 + 30 times (with 200ms wait after each) to forcibly tier it up before we even start the pilot/warmup stages for RunStrategy.Throughput jobs (yes I know these numbers are undocumented and subject to change).
Also, now that tiered JIT is mature, we should also ask runtime to expose a way for us configure aggressive tiering to reduce the invoke count + wait time (previously discussed in https://github.com/dotnet/runtime/issues/13069, but it was closed and locked earlier this year).
[Edit] It looks like we already can configure aggressive tiering with DOTNET_TC_AggressiveTiering environment variable, according to https://medium.com/@meriffa/net-core-concepts-tiered-compilation-10f7da3a29c7.
the JIT stage to invoke 30 + 30 times
There is an issue with long-running benchmarks that require a single invocation. If one invocation takes 10 minutes, we likely don't want to spend 60 iterations just on warmup. When we conduct a small number of pilot or warmup iterations for such a benchmark, we may still encounter tiered JIT effects. However, it appears we cannot resolve this without configuring the runtime.
It looks like if we set DOTNET_TC_AggressiveTiering=1, we will only need to invoke 2 times (tier0 + tier0-DPGO) to get it up to tier 1. We are already invoking 2 times in JIT stage now, so it should just be a matter of pausing between the invocations.
In which .NET version was DOTNET_TC_AggressiveTiering introduced? I remember that when I first explored the tiered JIT, this option was not available. Therefore, this approach may not function properly with some older .NET runtimes.
Additionally, as I understand it, the .NET Framework does not support tiered JIT. Consequently, all tiered JIT-related engine customizations should be implemented on a per-runtime basis.
In which .NET version was
DOTNET_TC_AggressiveTieringintroduced? I remember that when I first explored the tiered JIT, this option was not available. Therefore, this approach may not function properly with some older .NET runtimes.
It seems it was added in .Net 8. However, we can also set COMPlus_TC_CallCountThreshold=1 to do the same thing in .Net Core 3.0 or newer.
In .Net 6 or newer, we can set DOTNET_TC_CallCountingDelayMs=0 to reduce the time we need to wait between calls, but that's less important than the call count.
.Net Core 2.X doesn't have any configuration options for call count, but tiered JIT is disabled by default in that runtime anyway.
Additionally, as I understand it, the .NET Framework does not support tiered JIT. Consequently, all tiered JIT-related engine customizations should be implemented on a per-runtime basis.
Correct, and it's not an issue. I already added a flag to check if tiered JIT is enabled in #2562, we can add another flag for DPGO to get even more fine-grained.
Also, we can't configure the JIT in any way for in-process benchmarks, so updating the pilot/warmup stage will be needed for those toolchains anyway.
Actually, it looks like TC_AggressiveTiering was added in .Net 5 (so probably prefixed with COMPlus_ at the time) in https://github.com/dotnet/runtime/pull/1457 (it was also mentioned in https://github.com/dotnet/BenchmarkDotNet/issues/1993#issuecomment-1117419356). It was very much undocumented back then, but it seems to have stuck around.