stride icon indicating copy to clipboard operation
stride copied to clipboard

[Core] Performance fixes in GameProfilingSystem and ChromeTracingProfileWriter

Open manio143 opened this issue 2 years ago • 6 comments
trafficstars

PR Details

Description

This is a continuation of #1953 to fix performance issues with the Profiler. As part of this work I ended up producing #2049.

From #1953

  • Fix busy-wait in GameProfilingSystem and make surrounding code clearer.
  • Add manual flushing to the JsonWriter, so the internal buffer doesn't grow indefinitely and put it on the fast path by skipping validation.

Additionally:

  • Refactor GameProfilingSystem to extract all string formatting related logic into a private class GameProfilingSystemStrings
  • Modify a lot of formatting code to allocate less
  • Interchange between two different tasks in the GameProfilingSystem depending on the filtering mode
  • Add cancellation tokens to the Profiler subscribers
  • Remove semaphore lock from the Profiler in favor of ConcurrentDictionary with a flag checked using Interlocked
  • Move from Unbounded channels to Bounded to reduce allocation and incredible GC pressure
  • Add Task.Delay to consumption of events in GameProfilingSystem to further reduce allocation by waiting on channels

Related Issue

Closes #1953

Motivation and Context

The Profiler should be very efficient. After all if by measuring performance we decrease it, it's not a great system. These changes aim to improve mainly GC pressure by minimizing allocations. There's still further field for improvement that needs some deeper investigation - I was aiming to achieve perf parity between Fps profiler mode and CpuEvents -> it's close but there's still a bit of a difference. Furthermore there's still difference between measuring FPS without Profiling turned on and with.

Types of changes

  • [x] Refactoring
  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • [ ] My change requires a change to the documentation.
  • [ ] I have added tests to cover my changes.
  • [ ] All new and existing tests passed.
  • [ ] I have built and run the editor to try this change out.

manio143 avatar Nov 19 '23 22:11 manio143

I don't think we have tests for the profiler, could you run those changes ? Also pinging @froce for review, let us know if you don't have time, I should have time to take a look in your stead.

Eideren avatar Nov 23 '23 02:11 Eideren

I was running the game (Starbreach) with those changes to validate the allocations are reduced (using allocation tool in VS). Didn't mark the checkbox for the editor as I didn't invoke it, but will give it a try later.

manio143 avatar Nov 23 '23 19:11 manio143

I'm gonna try and create some BenchmarkDotNet samples to ensure my evidence wasn't incorrectly collected.

manio143 avatar Nov 28 '23 17:11 manio143

Benchmark description

I wanted to check there's a difference between bounded and unbounded channel for 1P, 1C scenario and what is the impact of including cancellation token on allocations.

Code
[MemoryDiagnoser]
[SimpleJob]
public class ProfilerBenchmark
{
    private static ProfilingKey ProfilingKey = new ProfilingKey("Test");

    private readonly Channel<ProfilingEvent> boundedChannel = Channel.CreateBounded<ProfilingEvent>(new BoundedChannelOptions(short.MaxValue)
    {
        SingleReader = true,
        SingleWriter = true,
        FullMode = BoundedChannelFullMode.DropWrite,
    });

    private readonly Channel<ProfilingEvent> unboundedChannel = Channel.CreateUnbounded<ProfilingEvent>(new UnboundedChannelOptions
    {
        SingleReader = true,
        SingleWriter = true,
    });

    private readonly CancellationTokenSource CTS = new();

    [Params(1000, 10_000)]
    public int N { get; set; }

    /// <summary>
    /// Reads events from the stream, uses in a calculation and returns a result to ensure the compiler doesn't optimize out relevant parts.
    /// </summary>
    private async Task<TimeSpan> Consume(ChannelReader<ProfilingEvent> eventReader)
    {
        TimeSpan sum = default;
        int i = 0;
        // await foreach (var e in eventReader.ReadAllAsync()) // without cancellation token
        await foreach (var e in eventReader.ReadAllAsync(CTS.Token))
        {
            sum += e.ElapsedTime;
            if (++i == N) break;
        }
        return sum;
    }

    private void Produce(ChannelWriter<ProfilingEvent> writer)
    {
        for (int i = 0; i < N; i++)
        {
            writer.TryWrite(new ProfilingEvent(1, ProfilingKey, ProfilingMessageType.Begin, default, default, 1, null, default));
        }
    }

    [Benchmark]
    public async Task Bounded()
    {
        Task consumer = Consume(boundedChannel.Reader);
        Produce(boundedChannel.Writer);
        await consumer;
    }

    [Benchmark]
    public async Task Unbounded()
    {
        Task consumer = Consume(unboundedChannel.Reader);
        Produce(unboundedChannel.Writer);
        await consumer;
    }
}

Machine

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3693/22H2/2022Update)
AMD Ryzen 5 2600X, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  Job-LYCRLG : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

Runtime=.NET 8.0

Without cancellation token

Method Job Toolchain IterationCount LaunchCount WarmupCount N Mean Error StdDev Allocated
Bounded DefaultJob Default Default Default Default 1000 459.7 us 20.31 us 59.88 us 822 B
Unbounded DefaultJob Default Default Default Default 1000 128.9 us 2.47 us 2.31 us 838 B
Bounded DefaultJob Default Default Default Default 10000 4,797.2 us 230.61 us 679.96 us 802 B
Unbounded DefaultJob Default Default Default Default 10000 1,119.2 us 9.72 us 8.62 us 841 B

With cancellation token

Method Job IterationCount LaunchCount WarmupCount N Mean Error StdDev Allocated
Bounded DefaultJob Default Default Default 1000 418.9 us 15.79 us 46.56 us 1050 B
Unbounded DefaultJob Default Default Default 1000 126.1 us 2.48 us 3.48 us 941 B
Bounded DefaultJob Default Default Default 10000 4,949.8 us 318.49 us 939.09 us 8028 B
Unbounded DefaultJob Default Default Default 10000 1,141.3 us 16.94 us 15.85 us 946 B

manio143 avatar Dec 03 '23 20:12 manio143

Is this ready for merge ?

Eideren avatar Jan 07 '24 18:01 Eideren

No, it needs further work. Might need to convert it to draft for now

manio143 avatar Jan 07 '24 22:01 manio143