osu-framework icon indicating copy to clipboard operation
osu-framework copied to clipboard

Remove local vertex storage from VertexBuffer

Open smoogipoo opened this issue 3 years ago • 22 comments

This is an RFC. I will rebase and clean up this PR so please don't attempt to go through by-commit

@peppy @bdach @frenzibyte This PR is in a "final" state for me, but I want your opinions and approvals on the structure of things and my proposed ideas before I go ahead with rebasing and cleaning it up. I've attempted to explain things (minus making a breaking changes section) so let me know if something doesn't make sense.

Some things like the condition added in PathDrawNode can/will be split out into separate PRs.

Closes https://github.com/ppy/osu-framework/issues/4228


Abstract

VertexBuffer currently houses CPU-side storage for all added vertices in order to de-dupe vertex uploads. This is inefficient in multiple regards:

  • Permanently increases memory footprint.
  • Possibly causes stutters as a result of the ImageSharp array pool allocating finalisable objects.
  • Increases processing time for vertices. Every vertex addition, new or not, compares equality.

But we have a mechanism to avoid this - Drawable invalidation. We know exactly when Drawables change and when DrawNodes are expected to draw new vertices as a result, which can be used to relax our requirement for vertex equality and remove the CPU-side storage.

Batching

Batching requires constant maintenance to improve, and our current batching mechanisms are starting to show their age. There is a lot that can be done to improve batching, such as having an allocator with more knowledge over the structure of the scene graph to decide which vertices to batch together.

DrawNode should eventually not be able to create their own batches. Any methods of creating custom batches should be seen as temporary pending further implementation of such an allocator, at which point they'll be replaced by a lower-level solution.

IRenderer

To cover the above changes to batching and future work in eventually removing GLWrapper entirety, the new IRenderer becomes the single new parameter of DrawNode.Draw():

+ public interface IRenderer
+ {
+     // Will eventually replace GLWrapper.
+ }
+ 
+ public class OpenGLRenderer : IRenderer
+ {
+ }
  
  public class DrawNode
  {
-     public virtual void Draw(Action<TexturedVertex2D> vertexAction)
+     public virtual void Draw(IRenderer renderer)
      {
          // Actual drawing commands.
      }
  }

For the time being, IRenderer.BeginQuads() is exposed to interact with the parenting CompositeDrawable's/global quad batch.

VertexGroup<T>/VertexGroupUsage<T>

VertexGroup<T> is the new object by which a DrawNode "batches" vertices together. It stores metadata about vertices such as invalidation state, but does not store the vertices themselves.

VertexGroupUsage<T> denotes a usage of a VertexGroup<T>, allowing the user to "add" vertices to the group. This is a stack-only object and thus cannot be stored. This object internally optimises for the case where new vertices don't need to be uploaded.

Drawing

Combining all the above, a DrawNode undergoes the following process to draw to the screen:

public class CustomDrawNode : DrawNode
{
    // Create one or more VertexGroups to draw.
    private readonly VertexGroup<TexturedVertex2D> vertices = new VertexGroup<TexturedVertex2D>();

    public override void Draw(IRenderer renderer)
    {
        // Set up blending.
        base.Draw(renderer);

        // Setup shaders, textures, masking, etc.
        shader.Bind();
        texture.Bind();

        // Start a new usage of the VertexGroup.
        using (VertexGroupUsage<TexturedVertex2D> usage = renderer.BeginQuads(this, vertices))
        {
            // When it is known how many vertices are to be drawn ahead of time,
            // preventing construction of vertices or other superfluous operations
            // will yield small performance gains.
            //
            // This can be applied on a case-by-case basis where it is easy to do so,
            // otherwise Add() internally optimises for individual vertex additions.
            if (!usage.TrySkip(4))
                return;

            usage.Add(...);
            usage.Add(...);
            usage.Add(...);
            usage.Add(...);
        }
    }
}

Important rules

  • A DrawNode must not change its vertices between frames unless the Drawable has been invalidated.
    • The new vertex values will not be rendered until the Drawable is invalidated.
    • The framework will not throw an exception if this contract is violated except if compiled in DEBUG.
  • A VertexGroup<T> can not be used multiple times in a single frame.
    • The framework will throw an exception if this contract is violated.
  • VertexGroupUsage<T>s can not be nested.
    • The framework will throw an exception if this contract is violated.

Custom batches

The IRenderer only provides a method to draw quads. For the time being, DrawNodes can create their own vertex batches if they need to and use them much like IRenderer.BeginQuads():

public class CustomDrawNode : DrawNode
{
    private readonly LinearBatch<TexturedVertex2D> batch = new LinearBatch<TexturedVertex2D>(..., PrimitiveType.Triangles);
    private readonly VertexGroup<TexturedVertex2D> vertices = new VertexGroup<TexturedVertex2D>();

    public override void Draw(IRenderer renderer)
    {
        base.Draw(renderer);

        shader.Bind();
        texture.Bind();

        using (var usage = batch.BeginUsage(this, vertices))
        {
            // As above.
        }
    }
}

Eventually, vertex batches will be removed from DrawNodes altogether and instead go via IRenderer.

Vertex redirection

TextureGL and the helper methods of DrawNode such as DrawNode.DrawQuad() will only output TexturedVertex2D-typed vertices. For custom shaders which only accept a subset of properties, VertexGroup provides a redirection mechanism:

public class CustomDrawNode : DrawNode
{
    // A VertexGroup which converts from TexturedVertex2D to CustomVertex.
    private readonly VertexGroup<TexturedVertex2D, CustomVertex> vertices = new VertexGroup<TexturedVertex2D, CustomVertex>(v => new CustomVertex { ... });

    public override Draw(IRenderer renderer)
    {
        // ...

        // As per normal:
        using (var usage = renderer.BeginQuads(this, vertices))
        {
            // ...
        }
    }
}

Testing

When compiled in DEBUG, osu!framework stores all vertices and throws exceptions upon incorrect usages of VertexGroup. These include:

  • Vertices changing between frames without an invalidation. Fatal (should be avoided at all cost).
  • Vertex equality comparer being broken. Non-fatal (only affects the results of the above check).
  • I don’t expect users to compile their own osu!framework, but I don’t expect these asserts to be hit unless the user violates the single drawing contract which osu!framework can't verify in RELEASE mode. Regardless, future work in removing vertex batches from DrawNode will allow us to define a different allocator at runtime instead.

Performance

  • Consistently seen a >25% reduction in frametime in semi-complex scenes (SongSelect, SongSelect+BeatmapListing, SongSelect+BeatmapListing+BeatmapOverlay). Extremely complex scenes (TestSceneMultiSpectator with 16 players) become fully update-bound.
  • I've seen 70MB reduction in total allocs during a play session, as well a reduction in steady-state RAM usage. A VRAM usage counter has been added to the native section in global statistics.
  • The only allocation remaining is a static 1024-vertex array used to buffer vertex uploads.

smoogipoo avatar Apr 18 '22 10:04 smoogipoo

Will it still be possible to cache data in a VBO to then be manipulated with a projection matrix instead of regenerating it when the ScreenSpaceQuad changes? For this to be possible I usually used a VertexBatch (which was kept alive and therefore its memory not reclaimed) but it sounds like they are going to go away at some point...

Flutterish avatar Apr 18 '22 18:04 Flutterish

A VertexGroupUsage.TryReuse(int count) method could be exposed which forgoes uploads if only the invalidation state has changed but not if anything else has changed like the target batch/insertion index/etc. Semantics would be the same as the existing TrySkip(int count).

Would that cover your use case?

smoogipoo avatar Apr 18 '22 19:04 smoogipoo

A VertexGroupUsage.TryReuse(int count) method could be exposed which forgoes uploads if only the invalidation state has changed but not if anything else has changed like the target batch/insertion index/etc. Semantics would be the same as the existing TrySkip(int count).

Would that cover your use case?

I'm not sure. The exact use case is more or less this:

// tiles is a collection of thousands of tiles which are to be merged into a single batch, 1 quad per
if ( tilesInvalidated ) {
    // this function will create and/or reuse quad vertex buffers to effectively group tiles into batches 
    // that use the same texture (or other uniforms) and generate quads for them
    // these buffers are also shared between all draw nodes (shared data, to avoid duplication because of the triple buffer)
    regenerateBuffers(); 
    tilesInvalidated = false;
}
// this is a quirk with o!f that makes me need to call UpdateRange(0,0) on all vertex buffers 
// because otherwise IVertexBuffer.Free is called and deletes both the cpu and gpu buffer
// this is because it expects for me to write to the buffer each frame, but I do it only once
// and then only draw from it
keepBuffersAlive();
shader.Bind();
// the matrix is generated off of the drawable Matrix3
shader.GetUniform<Matrix4>( "projMatrix" ).UpdateValue( ref matrix );
// this loops through all the texture -> vertex buffer entries and draws their whole content with VertexBuffer.Draw
drawBatches();
shader.Unbind();

It's not important for the vertices to be stored on the cpu, but the references to GL buffers and their lengths need to, which also means no other drawable can override these GL buffers. It is also important to be able to use several buffers, as switching between uniforms several thousand times versus just a few times is quite costly. Additionally I absolutely need to have control over when/if these buffers are cleared, as clearing them too often can lead to framerate change from 1.2k (never cleared) to about 10 (cleared every frame).

Flutterish avatar Apr 19 '22 00:04 Flutterish

Before I dig too deep, I was looking at the performance changes from this PR and couldn't find a test scene which shows any perceivable frame time difference. Is there a specific visual tests framework-side which can be used (a worst case example I guess?) or would this be feasible to add if not currently present?

peppy avatar Apr 19 '22 02:04 peppy

Also opening Ctrl+F3 seems to crash:

[runtime] 2022-04-19 02:28:36 [error]: An unhandled error has occurred.
[runtime] 2022-04-19 02:28:36 [error]: NUnit.Framework.AssertionException
[runtime] 2022-04-19 02:28:36 [error]: at osu.Framework.Logging.ThrowingTraceListener.Fail(String message) in /Users/dean/Projects/osu-framework/osu.Framework/Logging/ThrowingTraceListener.cs:line 23
[runtime] 2022-04-19 02:28:36 [error]: at System.Diagnostics.TraceInternal.Fail(String message)
[runtime] 2022-04-19 02:28:36 [error]: at System.Diagnostics.Trace.Assert(Boolean condition)
[runtime] 2022-04-19 02:28:36 [error]: at osu.Framework.Graphics.OpenGL.GLWrapper.Reset(Vector2 size, Int32 treeIndex) in /Users/dean/Projects/osu-framework/osu.Framework/Graphics/OpenGL/GLWrapper.cs:line 154
[runtime] 2022-04-19 02:28:36 [error]: at osu.Framework.Platform.GameHost.DrawFrame() in /Users/dean/Projects/osu-framework/osu.Framework/Platform/GameHost.cs:line 499
[runtime] 2022-04-19 02:28:36 [error]: at osu.Framework.Threading.GameThread.processFrame() in /Users/dean/Projects/osu-framework/osu.Framework/Threading/GameThread.cs:line 450
System.InvalidOperationException: Vertex addition was skipped, but the contained vertex differs.
   at osu.Framework.Graphics.Batches.VertexBatch`1.osu.Framework.Graphics.Batches.Internal.IVertexBatch.EnsureCurrentVertex[TVertex](IVertexGroup vertices, TVertex vertex, String failureMessage) in /Users/dean/Projects/osu-framework/osu.Framework/Graphics/Batches/VertexBatch.cs:line 110
   at osu.Framework.Graphics.Batches.VertexGroupUsage`1.Add(TInput vertex) in /Users/dean/Projects/osu-framework/osu.Framework/Graphics/Batches/VertexGroupUsage.cs:line 49
   at osu.Framework.Graphics.OpenGL.Textures.TextureGLSingle.DrawQuad(VertexGroupUsage`1& vertices, Quad vertexQuad, ColourInfo drawColour, Nullable`1 textureRect, Nullable`1 inflationPercentage, Nullable`1 blendRangeOverride, Nullable`1 textureCoords) in /Users/dean/Projects/osu-framework/osu.Framework/Graphics/OpenGL/Textures/TextureGLSingle.cs:line 322
   at osu.Framework.Graphics.OpenGL.Textures.TextureGLSub.DrawQuad(VertexGroupUsage`1& vertices, Quad vertexQuad, ColourInfo drawColour, Nullable`1 textureRect, Nullable`1 inflationPercentage, Nullable`1 blendRangeOverride, Nullable`1 textureCoords) in /Users/dean/Projects/osu-framework/osu.Framework/Graphics/OpenGL/Textures/TextureGLSub.cs:line 84
   at osu.Framework.Graphics.Textures.Texture.DrawQuad(VertexGroupUsage`1& vertices, Quad vertexQuad, ColourInfo drawColour, Nullable`1 textureRect, Nullable`1 inflationPercentage, Nullable`1 blendRangeOverride, Nullable`1 textureCoords) in /Users/dean/Projects/osu-framework/osu.Framework/Graphics/Textures/Texture.cs:line 183
   at osu.Framework.Graphics.Visualisation.TextureVisualiser.UsageBackground.UsageBackgroundDrawNode.Blit(VertexGroupUsage`1& usage) in /Users/dean/Projects/osu-framework/osu.Framework/Graphics/Visualisation/TextureVisualiser.cs:line 238
   at osu.Framework.Graphics.Sprites.SpriteDrawNode.Draw(IRenderer renderer) in /Users/dean/Projects/osu-framework/osu.Framework/Graphics/Sprites/SpriteDrawNode.cs:line 92

peppy avatar Apr 19 '22 02:04 peppy

Is there a specific visual tests framework-side which can be used

I suppose you could use TestSceneFrontToBack, for me it goes from 350 -> 450 fps.

Also opening Ctrl+F3 seems to crash

Fixed. That DrawNode's pretty dodgy.

smoogipoo avatar Apr 19 '22 06:04 smoogipoo

As an aside, I've gone through the visual test browser, and noticed TestSceneEasingCurves/TestSceneKeyBindingsGrid running quite slow compared to master:

Can you profile this one and paste the hottest paths? I can't repro - that test scene goes from 800 -> 1100 fps for me.

smoogipoo avatar Apr 20 '22 03:04 smoogipoo

From a 10 seconds record as I was entering the regressed test scene:

CleanShot 2022-04-20 at 07 04 56@2x

frenzibyte avatar Apr 20 '22 04:04 frenzibyte

Sticking to TestSceneEasingCurves for now, how many vertices does it upload per frame? GlobalStatistics shows me around 500 vertices per frame: image

smoogipoo avatar Apr 20 '22 04:04 smoogipoo

Draw thread for me (upload count looks similar):

this pr: JetBrains Rider 2022-04-20 at 04 26 52

master: JetBrains Rider 2022-04-20 at 04 31 49

(profiling time was not consistent, so only compare %s)

peppy avatar Apr 20 '22 04:04 peppy

VerticesDraw stays at a constant ~4500, while VerticesUpl spikes up from 500 to ~1100 at most (not sure if considerable).

https://user-images.githubusercontent.com/22781491/164150608-c2dde3d0-3cd4-44b4-aa7e-240ca5447dcb.mp4

frenzibyte avatar Apr 20 '22 04:04 frenzibyte

spikes up from 500 to ~1100 at most (not sure if considerable).

There's something more to this that I'll have to investigate. 1000 vertices is nothing at all since the game regularly uploads 10x more than that.

smoogipoo avatar Apr 20 '22 07:04 smoogipoo

From an initial investigation it seems macOS, and especially Apple Silicon, does really not like buffer orphaning after draw calls. It's something I didn't consider but should've - the standard case of you shouldn't modify a VBO that's already in flight.

I'll need to think more about the structure to figure out how to fix this.

smoogipoo avatar Apr 20 '22 09:04 smoogipoo

FWIW, I've managed to get Veldrid running on this branch https://github.com/ppy/osu-framework/compare/master...frenzibyte:veldrid-plus-vbo-invalidation (requires a local veldrid checkout at this branch), in case you are interested.

frenzibyte avatar Apr 20 '22 23:04 frenzibyte

@frenzibyte Regarding https://github.com/ppy/osu-framework/pull/5109#pullrequestreview-946404024, can you check again and see if my most recent commit fixes it?

It's not perfect, lacking tests, and a huge edge case isn't handled correctly, but should be enough to check performance for the time being.

smoogipoo avatar Jun 18 '22 09:06 smoogipoo

Need some more time to dwell on this one. The problem right now is changing of states that cause a flush, happening during a group's usage (textures, shaders, etc) when most of this made with the slight intention that that shouldn't be allowed without explicitly disallowing it.

Unfortunately, fixing this would mean inverting things like texture.DrawQuad() to not automatically bind internally, and require DrawNodes to bind manually.

smoogipoo avatar Jun 18 '22 16:06 smoogipoo

@frenzibyte Regarding #5109 (review), can you check again and see if my most recent commit fixes it?

It's not perfect, lacking tests, and a huge edge case isn't handled correctly, but should be enough to check performance for the time being.

It has brought performance back on par with master for TestSceneEasingCurves, but is still degraded a bit in TestSceneKeyBindingsGrid (PR shows 150fps while it runs at 250fps on master), for whatever that's worth. Great to see nonetheless.

frenzibyte avatar Jun 18 '22 22:06 frenzibyte

I want to do another review pass on this PR, but first confirming that it's in a final state @smoogipoo ?

Do you have anything to add regarding the performance regression in TestSceneKeyBindingsGrid mentioned above? For me on release 234 (master) vs 280 (this branch), on debug 130 (master) vs 104 (this branch). Which seems fine. Was @frenzibyte's above comment in debug mode?

peppy avatar Jun 20 '22 05:06 peppy

Was @frenzibyte's above comment in debug mode?

Apparently wasn't according to https://discord.com/channels/188630481301012481/589331078574112768/987972289716965427, however the takeaway is that behaviour is different on M1 (I believe) vs desktop. M1 really doesn't like VBOs being reused in one draw frame whereas AMD/NV seems to not mind it as much (probably to varying degrees, it's still incorrect to do).

Do you have anything to add regarding the performance regression in TestSceneKeyBindingsGrid mentioned above?

The above being said, I don't know why the difference in performance still remains to some degree.

but first confirming that it's in a final state

It's close enough, but I still have to refactor and re-PR because this one's kinda messy. The general idea is there, though, so you could review it from that perspective.

smoogipoo avatar Jun 20 '22 06:06 smoogipoo

but I still have to refactor and re-PR because this one's kinda messy.

Best to block pending that so no one gets too deep in reviewing here then?

peppy avatar Jun 20 '22 06:06 peppy

This PR is blocked and there's a message in the PR description about it, is that enough?

smoogipoo avatar Jun 20 '22 06:06 smoogipoo

Ah, missed that. Is fine.

peppy avatar Jun 20 '22 06:06 peppy