BenchmarkDotNet icon indicating copy to clipboard operation
BenchmarkDotNet copied to clipboard

MinIterationTime warning occurred when using `IterationTime:250ms` setting.

Open filzrev opened this issue 6 months ago • 9 comments

I'm trying to using dotnet/performance repository's RecommendedConfig settings to reduce benchmark times.

But when using IterationTime:250ms setting. Following warning are randomly raised for some benchmarks.

// * Warnings *
MinIterationTime
  DummyBenchmark.RunDummyBenchmark: Toolchain=.NET 9.0, IterationTime=250ms, LaunchCount=1, MaxIterationCount=20, MinIterationCount=15, RunStrategy=Throughput, WarmupCount=25 -> The minimum observed iteration time is 74.728ms which is very small. It's recommended to increase it to at least 100ms using more operations.

This warning is raised, because iteration time is expected to be around 250ms. But it actual execution time is 74.728ms and it's lower than MinIterationTimeAnalyser's threshold (100ms)

Problem Details IterationTime setting is used by WorkloadPilot phase to estimate PerfectInvocationCount But from benchmark logs. It seems JIT process is not sufficient and underestimate PerfectInvocationCount.
(It expected to be around 250ms with 80ops but 75ms on actual run)

  • It's occurred when using .NET 9 toolchain. And it's not occurred with .NET8 toolchain.
  • Increasing warmup count is not works because warmup is executed after WorkloadPilot phase

Benchmark Logs

IterationTime=250ms, LaunchCount=1, MaxIterationCount=20, MinIterationCount=15, RunStrategy=Throughput, WarmupCount=25

OverheadJitting  1: 1 op, 110200.00 ns, 110.2000 us/op
WorkloadJitting  1: 1 op, 5121600.00 ns, 5.1216 ms/op

OverheadJitting  2: 16 op, 204700.00 ns, 12.7938 us/op
WorkloadJitting  2: 16 op, 61949900.00 ns, 3.8719 ms/op

WorkloadPilot    1: 16 op, 61489600.00 ns, 3.8431 ms/op
WorkloadPilot    2: 80 op, 303822700.00 ns, 3.7978 ms/op

OverheadWarmup   1: 80 op, 2300.00 ns, 28.7500 ns/op
OverheadWarmup   2: 80 op, 600.00 ns, 7.5000 ns/op
OverheadWarmup   3: 80 op, 1000.00 ns, 12.5000 ns/op
OverheadWarmup   4: 80 op, 600.00 ns, 7.5000 ns/op
OverheadWarmup   5: 80 op, 500.00 ns, 6.2500 ns/op
OverheadWarmup   6: 80 op, 700.00 ns, 8.7500 ns/op
OverheadWarmup   7: 80 op, 300.00 ns, 3.7500 ns/op

OverheadActual   1: 80 op, 800.00 ns, 10.0000 ns/op
...
OverheadActual  20: 80 op, 800.00 ns, 10.0000 ns/op

WorkloadWarmup   1: 80 op, 268831700.00 ns, 3.3604 ms/op
WorkloadWarmup   2: 80 op, 74450400.00 ns, 930.6300 us/op
WorkloadWarmup   3: 80 op, 73728100.00 ns, 921.6013 us/op
WorkloadWarmup   4: 80 op, 76103400.00 ns, 951.2925 us/op
WorkloadWarmup   5: 80 op, 76514800.00 ns, 956.4350 us/op
WorkloadWarmup   6: 80 op, 75428500.00 ns, 942.8563 us/op
WorkloadWarmup   7: 80 op, 74758200.00 ns, 934.4775 us/op
WorkloadWarmup   8: 80 op, 75704000.00 ns, 946.3000 us/op
WorkloadWarmup   9: 80 op, 76330000.00 ns, 954.1250 us/op
WorkloadWarmup  10: 80 op, 75625400.00 ns, 945.3175 us/op
WorkloadWarmup  11: 80 op, 74481300.00 ns, 931.0163 us/op
WorkloadWarmup  12: 80 op, 75432200.00 ns, 942.9025 us/op
WorkloadWarmup  13: 80 op, 75695300.00 ns, 946.1913 us/op
WorkloadWarmup  14: 80 op, 75814400.00 ns, 947.6800 us/op
WorkloadWarmup  15: 80 op, 75705000.00 ns, 946.3125 us/op
WorkloadWarmup  16: 80 op, 74354000.00 ns, 929.4250 us/op
WorkloadWarmup  17: 80 op, 75110700.00 ns, 938.8838 us/op
WorkloadWarmup  18: 80 op, 74633700.00 ns, 932.9213 us/op
WorkloadWarmup  19: 80 op, 76451700.00 ns, 955.6463 us/op
WorkloadWarmup  20: 80 op, 75279300.00 ns, 940.9913 us/op
WorkloadWarmup  21: 80 op, 75362900.00 ns, 942.0363 us/op
WorkloadWarmup  22: 80 op, 76845000.00 ns, 960.5625 us/op
WorkloadWarmup  23: 80 op, 73960300.00 ns, 924.5038 us/op
WorkloadWarmup  24: 80 op, 75225200.00 ns, 940.3150 us/op
WorkloadWarmup  25: 80 op, 76821200.00 ns, 960.2650 us/op

// BeforeActualRun
WorkloadActual   1: 80 op, 75935300.00 ns, 949.1913 us/op
WorkloadActual   2: 80 op, 75302800.00 ns, 941.2850 us/op
WorkloadActual   3: 80 op, 75230900.00 ns, 940.3863 us/op
WorkloadActual   4: 80 op, 77478000.00 ns, 968.4750 us/op
WorkloadActual   5: 80 op, 75984200.00 ns, 949.8025 us/op
WorkloadActual   6: 80 op, 74728400.00 ns, 934.1050 us/op
WorkloadActual   7: 80 op, 76113100.00 ns, 951.4138 us/op
WorkloadActual   8: 80 op, 75724500.00 ns, 946.5563 us/op
WorkloadActual   9: 80 op, 76461300.00 ns, 955.7663 us/op
WorkloadActual  10: 80 op, 77062700.00 ns, 963.2838 us/op
WorkloadActual  11: 80 op, 75235800.00 ns, 940.4475 us/op
WorkloadActual  12: 80 op, 76656800.00 ns, 958.2100 us/op
WorkloadActual  13: 80 op, 75006700.00 ns, 937.5838 us/op
WorkloadActual  14: 80 op, 75713500.00 ns, 946.4188 us/op
WorkloadActual  15: 80 op, 76725100.00 ns, 959.0638 us/op

BenchmarkCode

        static void Main(string[] args)
        {
            var config = ManualConfig.Create(DefaultConfig.Instance);
            config.AddDiagnoser(DefaultConfig.Instance.GetDiagnosers().ToArray());
            config.AddJob(
                Job.Default
                   .WithLaunchCount(1)
                   .WithWarmupCount(25)
                   .WithIterationTime(TimeInterval.FromMilliseconds(250)) // Default: 500ms
                   .WithMinIterationCount(15) // Default: 15
                   .WithMaxIterationCount(20) // Default: 100
                   .WithToolchain(CsProjCoreToolchain.NetCoreApp90)
                   //.WithEnvironmentVariable("DOTNET_TC_AggressiveTiering", "1")
                   .WithStrategy(RunStrategy.Throughput));

            BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config: config);
        }

        public class DummyBenchmark
        {
            private static readonly Consumer consumer = new Consumer();

            [Benchmark]
            public void RunDummyBenchmark()
            {
                foreach (var i in Enumerable.Range(1, 1_000_000))
                    consumer.Consume(i);
            }
        }

filzrev avatar Jun 18 '25 07:06 filzrev

See also #1210

timcassell avatar Jun 18 '25 09:06 timcassell

A simple workaround is to ask the Pilot stage to complete several iterations with the appropriate iteration time. However, this may not work reliably since we cannot control how tiered JIT behaves.

A more reliable solution could involve merging the Pilot stage with the Warmup stage. This would allow the Warmup stage to calibrate the optimum invocation count on the fly. With tiered JIT, this approach seems more reasonable, as we will not be able to accurately detect the proper invocation count before the warmup is finished.

@timcassell what do you think?

AndreyAkinshin avatar Jun 24 '25 14:06 AndreyAkinshin

Merging pilot and warmup into a single stage sounds like a good idea to me.

In addition to that (or separately), I actually think we should adjust the JIT stage to invoke 30 + 30 times (with 200ms wait after each) to forcibly tier it up before we even start the pilot/warmup stages for RunStrategy.Throughput jobs (yes I know these numbers are undocumented and subject to change).

Also, now that tiered JIT is mature, we should also ask runtime to expose a way for us configure aggressive tiering to reduce the invoke count + wait time (previously discussed in https://github.com/dotnet/runtime/issues/13069, but it was closed and locked earlier this year).

[Edit] It looks like we already can configure aggressive tiering with DOTNET_TC_AggressiveTiering environment variable, according to https://medium.com/@meriffa/net-core-concepts-tiered-compilation-10f7da3a29c7.

timcassell avatar Jun 24 '25 15:06 timcassell

the JIT stage to invoke 30 + 30 times

There is an issue with long-running benchmarks that require a single invocation. If one invocation takes 10 minutes, we likely don't want to spend 60 iterations just on warmup. When we conduct a small number of pilot or warmup iterations for such a benchmark, we may still encounter tiered JIT effects. However, it appears we cannot resolve this without configuring the runtime.

AndreyAkinshin avatar Jun 24 '25 16:06 AndreyAkinshin

It looks like if we set DOTNET_TC_AggressiveTiering=1, we will only need to invoke 2 times (tier0 + tier0-DPGO) to get it up to tier 1. We are already invoking 2 times in JIT stage now, so it should just be a matter of pausing between the invocations.

timcassell avatar Jun 24 '25 17:06 timcassell

In which .NET version was DOTNET_TC_AggressiveTiering introduced? I remember that when I first explored the tiered JIT, this option was not available. Therefore, this approach may not function properly with some older .NET runtimes.

Additionally, as I understand it, the .NET Framework does not support tiered JIT. Consequently, all tiered JIT-related engine customizations should be implemented on a per-runtime basis.

AndreyAkinshin avatar Jun 24 '25 17:06 AndreyAkinshin

In which .NET version was DOTNET_TC_AggressiveTiering introduced? I remember that when I first explored the tiered JIT, this option was not available. Therefore, this approach may not function properly with some older .NET runtimes.

It seems it was added in .Net 8. However, we can also set COMPlus_TC_CallCountThreshold=1 to do the same thing in .Net Core 3.0 or newer. In .Net 6 or newer, we can set DOTNET_TC_CallCountingDelayMs=0 to reduce the time we need to wait between calls, but that's less important than the call count. .Net Core 2.X doesn't have any configuration options for call count, but tiered JIT is disabled by default in that runtime anyway.

Additionally, as I understand it, the .NET Framework does not support tiered JIT. Consequently, all tiered JIT-related engine customizations should be implemented on a per-runtime basis.

Correct, and it's not an issue. I already added a flag to check if tiered JIT is enabled in #2562, we can add another flag for DPGO to get even more fine-grained.

timcassell avatar Jun 24 '25 18:06 timcassell

Also, we can't configure the JIT in any way for in-process benchmarks, so updating the pilot/warmup stage will be needed for those toolchains anyway.

timcassell avatar Jun 24 '25 18:06 timcassell

Actually, it looks like TC_AggressiveTiering was added in .Net 5 (so probably prefixed with COMPlus_ at the time) in https://github.com/dotnet/runtime/pull/1457 (it was also mentioned in https://github.com/dotnet/BenchmarkDotNet/issues/1993#issuecomment-1117419356). It was very much undocumented back then, but it seems to have stuck around.

timcassell avatar Jun 24 '25 19:06 timcassell