bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Pack multiple meshes into vertex and index buffers.

Open pcwalton opened this issue 9 months ago • 4 comments

The underlying allocation algorithm is offset-allocator, which is a port of Sebastian Aaltonen's OffsetAllocator. It's a fast, simple hard real time allocator in the two-level segregated fit family.

Allocations are divided into two categories: regular and large. Regular allocations go into one of the shared slabs managed by an allocator. Large allocations get their own individual slabs. Due to platform limitations, on WebGL 2 all vertex buffers are considered large allocations that get their own slabs; however, index buffers can still be packed together. The slab size is 32 MB by default, but the developer can adjust it manually.

The mesh bin key and compare data have been reworked so that the slab IDs are compared first. That way, meshes in the same vertex and index buffers tend to be drawn together. Note that this only works well for opaque meshes; transparent meshes must be sorted into draw order, so there's less opportunity for grouping.

The purpose of packing meshes together is to reduce the number of times vertex and index buffers have to be re-bound, which is expensive. In the future, we'd like to use multi-draw, which allows us to draw multiple meshes with a single drawcall, as long as they're in the same buffers. Thus, this patch paves the way toward multi-draw, and with it a GPU-driven pipeline.

Even without multi-draw, this patch results in significant performance improvements. For me, the command submission time (i.e. GPU time plus driver and wgpu overhead) for Bistro goes from 4.07ms to 1.42ms without shadows (2.8x speedup); with shadows it goes from 6.91ms to 2.62ms (2.45x speedup). The number of vertex and index buffer switches in Bistro is reduced from approximately 3,600 to 927, with the vast majority of the remaining switches due to the transparent pass.

Bistro, without shadows. Yellow is this PR; red is main. Screenshot 2024-05-02 200257

Bistro, with shadows. Yellow is this PR; red is main. Screenshot 2024-05-02 200413


Changelog

Added

  • Multiple meshes can now be packed together into vertex and index buffers, which reduces state changes and provides performance improvements.

Migration Guide

  • The vertex and index data in GpuMesh are now GpuAllocations instead of Buffers, to facilitate packing multiple meshes in the same buffer. To fetch the buffer corresponding to a GpuAllocation, use the buffer() method in the new GpuAllocator resource. Note that the allocation may be located anywhere in the buffer; use the offset() method to determine its location.

pcwalton avatar May 03 '24 20:05 pcwalton

Awesome! I will try to read through and test this over the weekend.

Will this also give us a second perf improvement when order-independent-transparency lands? That should mean we can basically stop sorting, right?

NthTensor avatar May 04 '24 13:05 NthTensor

@NthTensor Yes, I would think so.

pcwalton avatar May 04 '24 22:05 pcwalton

sc_1715093073 sc_1715092940

I can replicate the issues with 2d shapes. That's weird.

NthTensor avatar May 07 '24 14:05 NthTensor

Updated to main and fixed the 2D meshes problem. It was a simple mistake when porting the logic from 3D over: in the indexed path, for the draw_indexed method call I forgot to switch the base_vertex parameter from 0 to the actual location in the buffer.

pcwalton avatar May 17 '24 05:05 pcwalton

Comments addressed

pcwalton avatar May 21 '24 02:05 pcwalton

[2:32 PM]pcwalton: I think 0.15 would be best for that. [2:32 PM]pcwalton: Specifically my concern is the size of the slabs: the heuristic hasn't been well tuned and might balloon memory usage in some cases. We won't know without testing. [2:32 PM]pcwalton: It's the kind of thing we'll only know through a cycle of testing on main so I'd be uncomfortable with 0.14. Besides, @Griffin found that the savings are very situational

Blocking until 0.14 is shipped.

alice-i-cecile avatar May 21 '24 18:05 alice-i-cecile

This is ready to go, but I think it would be best to wait until 0.15 and not merge for 0.14. The reason is that the memory usage heuristics haven't been well tuned yet. We'll only know what the best heuristics are through a cycle of testing.

pcwalton avatar May 21 '24 18:05 pcwalton

I would like to nominate this for the release notes. I think the performance gains are significant enough that users would enjoy reading about it. (Not to mention I've been watching this in the background every since offset-allocator hit the front page of HN!)

BD103 avatar May 22 '24 21:05 BD103

Agreed :) In the future, feel free to just add the label yourselves: it's easy to make the editorial call to split or lump things during the final release notes process.

alice-i-cecile avatar May 22 '24 21:05 alice-i-cecile

I'm closing this because I've written it.

pcwalton avatar Jul 08 '24 17:07 pcwalton