BenchmarkDotNet
BenchmarkDotNet copied to clipboard
ValueTask-returning benchmarks are not being called correctly
When a benchmark returns a ValueTask, it is simply called with .GetAwaiter().GetResult().
This is illegal ValueTask usage: you must not call GetResult() until the task has been completed.
IValueTaskSource-backed tasks do not need to support blocking on GetResult(). The in-box ManualResetValueTaskSourceCore does not support blocking GetResult(). Nor does the DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS feature in .NET 5.
Instead, the ValueTask should be converted to a Task first:
static void BlockForResult(ValueTask task)
{
if (task.IsCompleted)
task.GetAwaiter().GetResult();
else
task.AsTask().GetAwaiter().GetResult();
}
(This will be unfortunate for the memory diagnoser, as a bogus Task will show up, but the alternative is to just crash...)
@scalablecory is is true for both ValueTask and ValueTask<T> or just ValueTask?
@stephentoub do you have any suggestions for a solution with minimum overhead?
is is true for both
ValueTaskandValueTask<T>or justValueTask?
Both. https://devblogs.microsoft.com/dotnet/understanding-the-whys-whats-and-whens-of-valuetask/
do you have any suggestions for a solution with minimum overhead?
The best would be if whatever invocation loop benchmark.net generates was async and the returned task, whether a Task or a ValueTask, was just awaited. There is overhead to synchronously blocking on a Task as well, overhead that doesn't exist when directly awaiting the Task.
for (int i = 0; i < invokeCount; i++)
{
await workloadDelegate();
}
where the delegate type is defined to return Task/ValueTask instead of void, and where there is no .GetAwaiter().GetResult() in each invocation; rather, it should be at the top-most level possible.
I was thinking about how to achieve this. What @stephentoub said makes sense, so I was thinking of a way to do it without having to refactor the entire engine, only the measurement. I came up with this:
async void measurer
public class TaskMeasurer
{
private readonly object _locker = new object();
private readonly Func<Task> _taskRunner;
private IClock _clock;
private double _elapsedNanoseconds;
public TaskMeasurer(Func<Task> taskRunner)
{
_taskRunner = taskRunner;
}
public double Measure(IClock clock, int repeatCount)
{
if (Interlocked.CompareExchange(ref _clock, clock, null) != null)
{
throw new InvalidOperationException("Can only measure one task at a time.");
}
RunTasks(clock, repeatCount);
lock (_locker)
{
if (double.IsNaN(_elapsedNanoseconds))
{
Monitor.Wait(_locker);
}
}
double nanoSeconds = _elapsedNanoseconds;
_clock = default;
return nanoSeconds;
}
private async void RunTasks(IClock clock, int repeatCount)
{
_elapsedNanoseconds = double.NaN;
StartedClock startedClock = clock.Start();
for (int i = 0; i < repeatCount; ++i)
{
await _taskRunner();
}
_elapsedNanoseconds = startedClock.GetElapsed().GetNanoseconds();
lock (_locker)
{
Monitor.Pulse(_locker);
}
}
}
This makes sense logically, but it has a critical issue: async void has overhead of its own. So I thought, write a custom task-like type to eliminate that overhead (or use another one that already exists, like UniTask)? Then I realized that's overkill for this. We can write the simple state machine ourselves, mimicking how async method builders do it, using GetAwaiter() directly. Thus that evolved into this:
GetAwaiter measurer
public class TaskMeasurer
{
private readonly object _locker = new object();
private readonly Action _continuation;
private readonly Func<Task> _taskRunner;
private IClock _clock;
private StartedClock _startedClock;
private int _repeatsRemaining;
private double _elapsedNanoseconds;
private TaskAwaiter _currentAwaiter;
public TaskMeasurer(Func<Task> taskRunner)
{
_taskRunner = taskRunner;
_continuation = Continuation;
}
public double Measure(IClock clock, int repeatCount)
{
if (Interlocked.CompareExchange(ref _clock, clock, null) != null)
{
throw new InvalidOperationException("Can only measure one task at a time.");
}
_repeatsRemaining = repeatCount;
_elapsedNanoseconds = double.NaN;
_startedClock = clock.Start();
RunTask();
lock (_locker)
{
if (double.IsNaN(_elapsedNanoseconds))
{
Monitor.Wait(_locker);
}
}
double nanoSeconds = _elapsedNanoseconds;
_currentAwaiter = default;
_startedClock = default;
_clock = default;
return nanoSeconds;
}
private void RunTask()
{
Repeat:
--_repeatsRemaining;
_currentAwaiter = _taskRunner().GetAwaiter();
if (_currentAwaiter.IsCompleted)
{
_currentAwaiter.GetResult();
if (_repeatsRemaining == 0)
{
Finish();
return;
}
goto Repeat;
}
_currentAwaiter.OnCompleted(_continuation);
}
private void Continuation()
{
_currentAwaiter.GetResult();
if (_repeatsRemaining == 0)
{
Finish();
}
else
{
RunTask();
}
}
private void Finish()
{
_elapsedNanoseconds = _startedClock.GetElapsed().GetNanoseconds();
lock (_locker)
{
Monitor.Pulse(_locker);
}
}
}
With this method, we can swap out the current measurement code to use this class, and a similar class can be used for synchronous measurements, both using an IMeasurer interface or something. Admittedly, this would still require refactoring the auto-generated code. What do you guys think?
[Edit] Actually I guess the entire engine would have to be refactored anyway due to how the WorkloadAction is currently set up... But this should effectively replace WorkloadAction.
And this should probably catch any exceptions from GetResult and propagate it back out of Measure, which I didn't included here, but should be easy enough.
With the interest in supporting Unity #1860 and potentially its async operations which usually require execution over multiple frames, would it make sense to refactor the entire engine to be async and the process entry point synchronously block on the async engine (Task.Wait()), so that Unity could wait on the async engine its own way? Or does my suggestion above make more sense for now? (Engine is still synchronous, IMeasurer blocks internally.)
[Edit] Trying to measure Unity's async operations with a synchronous engine would result in a deadlock, which is why I ask this.
@timcassell When running sync benchmarks the engine should not be doing any async work (as it could affect the results by introducing some noise). We should most probably have two engines (sync and async) or extend existing engine with async support.
@adamsitnik The way I was imagining an async engine working with sync benchmarks is, still using an IMeasurer interface, IMeasurer.Measure(IClock) returns a ValueTask<double> which is already resolved with the measured time. So the engine can be fully async and work with both sync and async benchmarks without overhead. The actual measurement would just be shifted to the measurer instead of the engine, but the engine still orchestrates everything (GlobalSetup and all).
And for async measurers, they can re-use a cached, resettable IValueTaskSource to not interfere with the memory diagnoser.
I guess the next question is, does the engine care what thread it's running on if it's async? Because if the engine is async and does double nanoseconds = await workloadMeasurer.Measure(clock);, it could continue on a background thread. I remember something about BDN trying to set the CPU thread priority, but I'm not sure if it matters.
It might actually be okay to keep the measurement in the engine, because the async state machine checking if the ValueTask is already complete is just a null check, which is even cheaper than invoking a delegate. But if we move it to the measurer, it could completely eliminate both (depending if the measurer calls the benchmark method directly, or through a delegate).
I started looking at this, and changing the engine itself to be async has a lot of repercussions across the entire library. The main issue isn't the refactoring of the internal engine components, but the public API surface. Currently, the public APIs are all synchronous, and to be able to support async operations in UI applications (like Unity) would require adding async APIs to sit beside the sync APIs. Technically, this shouldn't really matter for out-of-process toolchains, as the code is generated and ran in a separate process so it can do whatever it wants. This is really necessary for in-process toolchains.
Anyway, I'm going to go ahead with my earlier idea of just fixing ValueTasks to work properly in the synchronous engine, and we can revisit adding async engine support later.
#1941 Is a stop-gap to get ValueTasks to work again. It is still using the less-efficient (more overhead) synchronous blocking for each benchmark method call. A more efficient (less overhead) version like @stephentoub mentioned can also be done, but the unroll factor should be forced to 1 for async benchmarks (unrolling awaits makes the state machine more complex and doesn't help with performance, unlike sync unrolling). @adamsitnik Is there an easy way to force the unroll factor to 1 for async benchmarks?
I did some refactoring to move the measurement to the workload call and do proper awaits on the returned (Value)Task, and the results are interesting.
[Benchmark]
public async Task<int> AsyncYield()
{
await Task.Yield();
unchecked { ++counter; }
return 42;
}
[Benchmark]
public Task<int> AsyncImmediate()
{
unchecked { ++counter; }
return Task.FromResult(42);
}
[Benchmark]
public int Sync()
{
unchecked { ++counter; }
return 42;
}
var config = DefaultConfig.Instance.AddJob(Job.Default
.WithUnrollFactor(1) // Forced to 1 cause I haven't figured out how to force it in the lib yet.
.AsDefault());
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config);
Current:
| Method | Mean | Error | StdDev | Gen 0 | Allocated |
|---|---|---|---|---|---|
| AsyncYield | 1,853.1542 ns | 11.5610 ns | 10.8142 ns | 0.0381 | 120 B |
| AsyncImmediate | 20.5946 ns | 0.1705 ns | 0.1331 ns | 0.0229 | 72 B |
| Sync | 0.4096 ns | 0.0115 ns | 0.0108 ns | - | - |
After refactor:
| Method | Mean | Error | StdDev | Gen 0 | Allocated |
|---|---|---|---|---|---|
| AsyncYield | 1,472.9072 ns | 90.2642 ns | 264.7293 ns | 0.0381 | 120 B |
| AsyncImmediate | 23.4609 ns | 0.1313 ns | 0.1025 ns | 0.0229 | 72 B |
| Sync | 0.3818ns | 0.0529 ns | 0.0741 ns | - | - |
Async completion is showing less overhead measured, while sync completion is showing more overhead measured. Which actually makes sense, as that's how the Task would be consumed in the real world in async scenarios using the await keyword (It's unusual to synchronously wait on a Task).
Measuring ValueTask instead of Task:
Current:
| Method | Mean | Error | StdDev | Median | Gen 0 | Allocated |
|---|---|---|---|---|---|---|
| AsyncYield | 1,896.6121 ns | 9.3127 ns | 8.7111 ns | 1,894.4635 ns | 0.0401 | 128 B |
| AsyncImmediate | 21.9522 ns | 0.0694 ns | 0.0649 ns | 21.9783 ns | - | - |
After refactor:
| Method | Mean | Error | StdDev | Median | Gen 0 | Allocated |
|---|---|---|---|---|---|---|
| AsyncYield | 1,610.8825 ns | 32.0146 ns | 29.9464 ns | 1,614.9253 ns | 0.0401 | 128 B |
| AsyncImmediate | 14.7426 ns | 0.0917 ns | 0.0858 ns | 14.7511 ns | - | - |
So, interestingly, only the Task<int> AsyncImmediate measures more overhead.