stride
stride copied to clipboard
[Core] Performance fixes in GameProfilingSystem and ChromeTracingProfileWriter
PR Details
Description
This is a continuation of #1953 to fix performance issues with the Profiler. As part of this work I ended up producing #2049.
From #1953
- Fix busy-wait in
GameProfilingSystemand make surrounding code clearer. - Add manual flushing to the
JsonWriter, so the internal buffer doesn't grow indefinitely and put it on the fast path by skipping validation.
Additionally:
- Refactor
GameProfilingSystemto extract all string formatting related logic into a private classGameProfilingSystemStrings - Modify a lot of formatting code to allocate less
- Interchange between two different tasks in the
GameProfilingSystemdepending on the filtering mode - Add cancellation tokens to the Profiler subscribers
- Remove semaphore lock from the Profiler in favor of ConcurrentDictionary with a flag checked using Interlocked
- Move from Unbounded channels to Bounded to reduce allocation and incredible GC pressure
- Add Task.Delay to consumption of events in GameProfilingSystem to further reduce allocation by waiting on channels
Related Issue
Closes #1953
Motivation and Context
The Profiler should be very efficient. After all if by measuring performance we decrease it, it's not a great system. These changes aim to improve mainly GC pressure by minimizing allocations. There's still further field for improvement that needs some deeper investigation - I was aiming to achieve perf parity between Fps profiler mode and CpuEvents -> it's close but there's still a bit of a difference. Furthermore there's still difference between measuring FPS without Profiling turned on and with.
Types of changes
- [x] Refactoring
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Checklist
- [ ] My change requires a change to the documentation.
- [ ] I have added tests to cover my changes.
- [ ] All new and existing tests passed.
- [ ] I have built and run the editor to try this change out.
I don't think we have tests for the profiler, could you run those changes ? Also pinging @froce for review, let us know if you don't have time, I should have time to take a look in your stead.
I was running the game (Starbreach) with those changes to validate the allocations are reduced (using allocation tool in VS). Didn't mark the checkbox for the editor as I didn't invoke it, but will give it a try later.
I'm gonna try and create some BenchmarkDotNet samples to ensure my evidence wasn't incorrectly collected.
Benchmark description
I wanted to check there's a difference between bounded and unbounded channel for 1P, 1C scenario and what is the impact of including cancellation token on allocations.
Code
[MemoryDiagnoser]
[SimpleJob]
public class ProfilerBenchmark
{
private static ProfilingKey ProfilingKey = new ProfilingKey("Test");
private readonly Channel<ProfilingEvent> boundedChannel = Channel.CreateBounded<ProfilingEvent>(new BoundedChannelOptions(short.MaxValue)
{
SingleReader = true,
SingleWriter = true,
FullMode = BoundedChannelFullMode.DropWrite,
});
private readonly Channel<ProfilingEvent> unboundedChannel = Channel.CreateUnbounded<ProfilingEvent>(new UnboundedChannelOptions
{
SingleReader = true,
SingleWriter = true,
});
private readonly CancellationTokenSource CTS = new();
[Params(1000, 10_000)]
public int N { get; set; }
/// <summary>
/// Reads events from the stream, uses in a calculation and returns a result to ensure the compiler doesn't optimize out relevant parts.
/// </summary>
private async Task<TimeSpan> Consume(ChannelReader<ProfilingEvent> eventReader)
{
TimeSpan sum = default;
int i = 0;
// await foreach (var e in eventReader.ReadAllAsync()) // without cancellation token
await foreach (var e in eventReader.ReadAllAsync(CTS.Token))
{
sum += e.ElapsedTime;
if (++i == N) break;
}
return sum;
}
private void Produce(ChannelWriter<ProfilingEvent> writer)
{
for (int i = 0; i < N; i++)
{
writer.TryWrite(new ProfilingEvent(1, ProfilingKey, ProfilingMessageType.Begin, default, default, 1, null, default));
}
}
[Benchmark]
public async Task Bounded()
{
Task consumer = Consume(boundedChannel.Reader);
Produce(boundedChannel.Writer);
await consumer;
}
[Benchmark]
public async Task Unbounded()
{
Task consumer = Consume(unboundedChannel.Reader);
Produce(unboundedChannel.Writer);
await consumer;
}
}
Machine
BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3693/22H2/2022Update)
AMD Ryzen 5 2600X, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
Job-LYCRLG : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
Runtime=.NET 8.0
Without cancellation token
| Method | Job | Toolchain | IterationCount | LaunchCount | WarmupCount | N | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|---|---|---|---|---|---|
| Bounded | DefaultJob | Default | Default | Default | Default | 1000 | 459.7 us | 20.31 us | 59.88 us | 822 B |
| Unbounded | DefaultJob | Default | Default | Default | Default | 1000 | 128.9 us | 2.47 us | 2.31 us | 838 B |
| Bounded | DefaultJob | Default | Default | Default | Default | 10000 | 4,797.2 us | 230.61 us | 679.96 us | 802 B |
| Unbounded | DefaultJob | Default | Default | Default | Default | 10000 | 1,119.2 us | 9.72 us | 8.62 us | 841 B |
With cancellation token
| Method | Job | IterationCount | LaunchCount | WarmupCount | N | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|---|---|---|---|---|
| Bounded | DefaultJob | Default | Default | Default | 1000 | 418.9 us | 15.79 us | 46.56 us | 1050 B |
| Unbounded | DefaultJob | Default | Default | Default | 1000 | 126.1 us | 2.48 us | 3.48 us | 941 B |
| Bounded | DefaultJob | Default | Default | Default | 10000 | 4,949.8 us | 318.49 us | 939.09 us | 8028 B |
| Unbounded | DefaultJob | Default | Default | Default | 10000 | 1,141.3 us | 16.94 us | 15.85 us | 946 B |
Is this ready for merge ?
No, it needs further work. Might need to convert it to draft for now