bevy Renderer optimization tracking issue

Renderer optimization tracking issue

Open superdump opened this issue 11 months ago • 3 comments

There are many ideas, branches, proofs of concepts, PRs, and discussions around improving the performance of the main code paths, systems, data structures, etc for rendering entities with meshes and materials.

This issue is a tracking issue to help with giving an overview of what has been considered, is known/has been tried, is almost ready but needs finishing off, needs reviewing, or has been merged. It should help to sequence work, and avoid forgetting things.

Optimizations for general usage

[x] Consolidate RenderAssets and RenderMaterials
- Problem:
  - RenderMaterials exists to work around limitations of the RenderAssets API
  - RenderMaterials is duplicated across 3D, 2D, UI, gizmos, anywhere where there is a duplicate of the Material API abstraction
  - Making a change to 'render assets' involves modifying not only RenderAssets but also RenderMaterials and all its duplicates
- Solution:
  - Make the generic type argument of RenderAsset be the target type (e.g. GpuMesh instead of Mesh). This removes the root cause that prevented reuse of RenderAssets for materials.
- Status: #12827
[ ] Split per-instance buffer dynamic offsets, and instance/batch ranges out of PhaseItem
- Problem:
  - Queuing, sorting, batching, rendering all interact with RenderPhase which contains a Vec<PhaseItem>
  - PhaseItem contains a Range<u32> (8 bytes) for the instance/batch range, and an Option<NonMaxU32> (4 bytes) for the dynamic offset in case of using a BatchedUniformBuffer. These take space in caches, and are more data to move around when sorting.
- Solution:
  - Queuing and sorting don't know anything about the instance/batch range nor dynamic offsets because they only get calculated in batch_and_prepare_render_phase
  - This leads to the conclusion that batch ranges and dynamic offsets should be moved out of PhaseItems.
  - They could be moved into separate Vec or EntityHashMap in RenderPhase, or separate components with similar data structures on the view, generic over the PhaseItem to allow there to be one per phase per view. The latter enables easier parallelism through ECS queries (Arc<Mutex<T>> members in RenderPhase would solve this too), but is perhaps a bit more awkward.
- Status: proof of concept branch by @superdump
[ ] Split MeshUniform GpuArrayBuffer preparation from batch_and_prepare_render_phase
- Problem:
  - batch_and_prepare_render_phase is a bottleneck
  - batch_and_prepare_render_phase prepares the per-instance data buffer because when using BatchedUniformBuffer on WebGL2/where storage buffers are not available, batches can be broken by filling up a uniform buffer binding (16kB minimum guaranteed, 64kB on NVIDIA, can be more on others) such that a new dynamic offset binding has to be started.
- Solution:
  - Implement a BatchedUniformBuffer dynamic offset calculator
  - Split preparation of the per-instance GpuArrayBuffer into a prepare_render_phase system that is separate from batch_and_prepare_render_phase and can be run in parallel with it. Rename batch_and_prepare_render_phase to batch_render_phase
- Status: proof of concept branch by @superdump
[ ] Split pipeline specialization out of queue systems
- Problem:
  - Queue systems do a lot of work to gather information from views, mesh assets, and material assets, including multiple lookups and calculations of pipeline specialization keys.
  - Queue systems are a bottleneck
  - The design of queue material mesh systems is such that we can re-specialize for every visible entity for every render phase for every view. This is maximum flexibility.
  - View configuration, mesh/material asset properties should only change rarely. In the vast vast majority of cases, view/mesh/material properties that impact pipeline specialization are configured once, at initialization/load time, and never again.
  - Compiling pipelines at runtime while a user is interacting with the application causes hitches and long-frames. In real-world games and applications, pipelines should ideally be prepared ahead of time (not necessarily offline, but before the start of a scene/level/area of a scene) so that when they are being used, there are no hitches.
  - The specialized pipelines design is highly optimized for fast lookups, but it also is doing a lot of unnecessary work.
  - Even though the specialized pipelines design is highly optimized, a lot of information has to be looked up to prepare the pipeline specialization keys to be able to do the lookup.
- Solution:
  - Instead of specializing pipelines per visible entity per render phase per view every frame, we can specialize only when something changes
  - In the main world in PostUpdate:
    - Detect whether anything impacting a view's contribution to the pipeline specialization key changes. If it does, mark it as a 'dirty' view that needs respecialization
    - Detect whether a mesh or material asset, or the handles to those on an entity have changed, if so, mark that entity as 'dirty' and in need of respecialization
  - When extracting, if an entity needs respecialization, gather the data needed to do so. If an entity was already specialized, fetch its pipeline id
  - Maintain mappings from assets to entities using those assets for later fast gathering of entities that need respecialization
  - In PrepareAssets:
    - Process Asset events to additionally identify entities that need re-specialization (if they use that asset)
    - Re-specialize all pipelines that need it and then update all pipeline ids for affected entities
  - Queue then simply looks up the pipeline id for the entity and uses it. This removes the need for much of the mesh/material asset lookups which is better for cache coherency in this very hot loop.
- Status: proof of concept branch by @superdump
[x] Support binned RenderPhases for opaque (note that alpha mask is also opaque) passes (including opaque pre and main passes, and shadow passes)
- Problem:
  - Queuing, sorting, and batching are bottlenecks
  - Opaque passes don't technically need to be absolutely ordered. Sorting from front to back enables graphics hardware to do early depth testing to know that they don't need to fragment shade a fragment that is further away than a previously-shaded fragment, but this is a possible optimization and perhaps other approaches give larger wins.
- Solution:
  - Opaque entities can be batched by various properties (pipeline, bind groups, dynamic offsets, etc) and their order doesn't strictly matter
  - Create a bin key based on those properties, and bin opaque entities into groups based on the bin key, e.g. HashMap<BinKey, Entity>
  - Sort the bin keys
  - Batching is then just iterating the bin keys and modifying the batch range and dynamic offset of an entity that can be later used to look up bindings and such for encoding the draw command
- Status: #12453
[ ] Improve MeshUniform inverse matrix calculation performance
- Problem:
  - Calculating inverse matrices is expensive, part of a bottleneck, and can be optimized further
- Solution:
  - Try using ultraviolet or another similar 'wide' SIMD crate to enable calculating many matrix inverses in parallel, instead of using 'vertical' SIMD like glam does and calculating for one at a time.
- Status: Idea
[ ] Batch directly into a 'draw stream'
- Problem:
  - Batching and encoding render passes (render graph node execution for pre/shadow/main passes) is slow and a bottleneck
  - Batching compares properties of drawable entities to see if they can be drawn together
  - Render pass draw command encoding uses TrackedRenderPass to keep track of draw state so that when draw commands (binding pipelines and buffers, updating dynamic offsets, etc) are issued to the TrackedRenderPass, it can compare and see if anything changed, and if not, it can skip passing the call on to wgpu. This means information is being compared twice, both in batching and rendering.
  - Both batching and rendering have to look up all the information needed to either identify batchability, or encode draw commands, so the lookups are being done twice too.
- Solution:
  - Batch directly into a draw stream
  - A draw stream is a concept from Sebastian Aaltonen's work for the HypeHype renderer. It is basically a Vec<u32> with a protocol. The first u32 is a bit field for a single draw command that contains bits indicating for example whether a pipeline needs to be rebound, a bind group needs rebinding, if there is an index/vertex buffer to be rebound, the type of draw (indexed vs not, direct vs inirect, etc). Then the u32s that follow contain the ids or information to be able to encode that draw.
  - The result is that lookups happen once when batching, and TrackedRenderPass is no longer needed in terms of checking whether something actually needs rebinding or so, because the draw stream only contains exactly what needs to be done.
  - Another fallout is that draw functions are no longer what is executed during draw command encoding. They should probably instead move to the batching stage or before, if necessary. This part is not yet figured out.
- Status: proof of concept branch by @superdump. Needs more design work to figure out what needs to be done with DrawFunction/RenderCommand.
[ ] Faster MeshUniform serialization
- Problem:
  - MeshUniform serialization is a bottleneck. encase performance is part of the problem.
- Solution:
  - Use manual padding and alignment, and bytemuck to bypass encase
- Status: proof of concept branch by @superdump
[ ] Write directly to wgpu staging buffers, avoiding memory copies
- Problem:
  - Bevy often has multiple copies of data such as per-instance data, sometimes a rust-side Vec<T: ShaderType>, that is then serialised into a Vec<u8> using encase, that is then given to wgpu's Queue::write_buffer() API. This results in making multiple copies, which costs performance
- Solution:
  - Use wgpu's write_buffer_with() APIs. This allows requesting a mutable slice into an internal wgpu staging buffer. Serializing data directly into this mutable slice then avoids lots of unnecessary copies.
- Status:
  - #12489 for the GpuArrayBuffer<MeshUniform>
  - Idea - there will be many more call sites that will need modifying to use this pattern
[ ] Parallelize MeshUniform buffer preparation across RenderPhases
- Problem:
  - batch_and_prepare_render_phase uses ResMut<GpuArrayBuffer<MeshUniform>> which means that these systems run serially as there is only one GpuArrayBuffer<MeshUniform> resource and only one system can mutate it at a time
- Solutions:
  - Use a GpuArrayBuffer<MeshUniform> per phase per view
    - This adds quite a lot of code complexity. Both @james7132 and @superdump have tried implementing it and it gets messy.
  - Get writers into mutable slices of the single GpuArrayBuffer per phase per view, and then prepare each phase for each view in parallel
- Status: #12489
[ ] Material data in arrays
- Noted in #89
- Problem:
  - Each material instance has its own bind group due to material data being prepared into an individual uniform buffer for that instance in AsBindGroup
  - Entities using the same material type and same material textures cannot be batched if their material data is different due to needing to bind a different bind group
- Solution:
  - Modify AsBindGroup to write material data into a GpuArrayBuffer per material type
  - Add the material index into per-instance data
  - Modify shaders to look up material data using the index from the per-instance data
- Status: proof of concept branch by @superdump
[ ] Use one set of large mesh buffers per vertex attribute layout
- Noted in #89
- Problem:
  - Index/vertex buffers have to be re-bound when the mesh changes. This adds overhead when encoding draws and when drawing. It also prevents some optimisations like being able to draw all objects for shadow mapping for a light in one draw.
- Solution:
  - Write all mesh data for meshes with the same vertex attributes into one large index buffer and one large vertex buffer
  - Use an appropriate allocator like a port of Sebastian Aaltonen's offset allocator to manage allocation
  - Manage index/vertex ranges for individual assets such that encoding a draw then only binds index/vertex buffers once and just gives the index/vertex range to the draw command.
- Status: Idea
[ ] Split large buffers of vertex attributes into one for position, one for everything else
- Problem:
  - Cache miss rate in shadow passes is high due to most vertex attributes being completely irrelevant
- Solution:
  - Split out position from other vertex attributes and put it into its own vertex buffer
  - This enables shadow passes for opaque entities to load only the position vertex buffer and get much higher cache hit rate and much better performance
- Status: Idea
[ ] Support texture atlasing in StandardMaterial
- Problem:
  - Entities can't be batched if they have different textures
- Solution:
  - Use texture atlases, possibly in array textures, to reduce the number of separate texture bindings that are needed
  - Add support for UV offset/scale etc as needed to map from vertex UVs to atlas UVs
- Status: Idea
[ ] Support bindless texture arrays
- Noted in #89
- Problem:
  - Entities can't be batched because they have different textures, and texture atlasing requires preprocessing of image assets that may not be possible/ideal to do at runtime given one wants to use compressed GPU texture formats that are very slow to compress
- Solution:
  - Support bindless texture arrays where all textures are bound in one big array
  - Store indices into bindless texture arrays in material data
- Status: Idea
[ ] Small generational indices for assets
- Problem:
  - Queuing, sorting, batching, preparation, and rendering are bottlenecks
  - AssetId is quite large due to having an UUID variant (16 bytes) which means slower hashing and worse cache hit rates (larger data uses more space in caches).
  - Loads, stores, hashes, and comparisons of AssetIds are done in all those bottleneck systems, either directly, or just because of being part of PhaseItems.
- Solutions:
  - @cart 's asset changes, possibly including Assets as Entitys
    - Uuid is removed!
  - Slotmap render assets
    - RenderAssets could instead use a SlotMap<PreparedAsset>, and HashMap<AssetId, SlotMapKey>.
    - A SlotMapKey is a u32 + u32 generational index, so 8 bytes.
    - On extraction, the SlotMapKey is looked up for the AssetId, or the AssetId is extracted to a separate queue of assets to be prepared and the SlotMapKey is later backfilled into the extracted data type after preparation is complete.
    - After extraction and asset preparation, the render app then exclusively uses the SlotMapKey to look up the PreparedAsset which avoids hashing entirely, is more cache friendly, and means less data to be sorted.~~
    - Status: https://github.com/bevyengine/bevy/pull/13013 by @superdump
  - Asset indexify
    - Make Handle and AssetId into structs containing AssetIndex and Option<Uuid>
    - Make const Handle into const Uuid and maintain a HashMap<Uuid, Handle<T>>, create the handles at runtime and insert into the map, look up from the map using the const Uuid
    - Status: https://github.com/superdump/bevy/tree/asset-indexify by @superdump which had better performance than slotmap render assets, but won't be merged, in favour of waiting on @cart 's coming asset changes

Optimizations for specific use cases

[ ] Improved directional light shadow cascade culling
- Problem:
  - Culling is done based on the direction light shadow cascade's frustum, which covers a much larger volume than the camera frustum slice corresponding to the cascade. This means we prepare for shadow casting onto regions outside the camera frustum
- Solution:
  - Cull to camera frustum slice bounds in light space
- Status: proof of concept branch by @superdump with some bugs
[ ] Use storage buffers for animation data to enable batching
- Problem:
  - Animated meshes cannot be batched, resulting in many individual draws despite entities being otherwise batchable
- Solution:
  - Aside from WebGL2, storage buffers can be used.
  - Write animated mesh data into a runtime-sized array in a storage buffer and write the index into that array into per-instance data. This enables one single binding to be used for all animated entities, and so they can be drawn in one batch if they meet the other requirements for batching. This gives a big performance boost.
- Status: #10094
[ ] Remove the built-in sprite renderer and use 2D material meshes
- Problem:
  - Sprite rendering is very optimized, but is very hard-coded and inflexible. No custom materials.
  - 2D material mesh rendering has not been as fast as sprite rendering.
  - Sprite rendering can only use quads, which incur significant overdraw that is a limiting factor for performance in some cases.
- Solution:
  - Use 2D material meshes for rendering sprites
  - The same or sprite-focused simplified main world components/assets can be used, but they can be translated into material meshes
  - Sprite rendering can then continue to benefit from all the subsequent optimizations made to the material mesh rendering infrastructure
- Status: Idea

Mar 20 '24 10:03 superdump

Other things to add:

[ ] Shadow atlasing for improved performance
- This also allows users to set a fixed VRAM budget for shadows. Having a fixed budget means we also need to figure out how to gracefully and automatically degrade shadows based on distance (or some other metric) to fit everything into the atlas.
- Caching this between frames and only updating shadow views when necessary (eg. don't update a view if it was static) can be a nice performance win.
[ ] Reduce our register pressure in shaders
- This is something that we'll eternally be fighting against, especially as more features are added.
- Currently our main_opaque_3d fragment shader has insane register pressure due to how many VGPRs we're using, which means we get poor throughput. Forward shading is expected to have high register pressure, but bevy's is currently insanely high.
- See https://gpuopen.com/learn/occupancy-explained/ for an explanation of GPU occupancy
- See https://flashypixels.wordpress.com/2018/11/10/intro-to-gpu-scalarization-part-1/ and https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/ for some articles on how to improve occupancy.
- RGA can be used if you dump bevy's shaders to spirv to view what's actually causing VGPR pressure, which is what's currently causing the low occupancy.
- Nvidia Nsight and AMD RGP can show what the register pressure is in a capture. Intel GPA and Apple's Xcode gpu debugger can probably show this. Idk about android.
[ ] FMA all the things.
- Afaik shader compilers won't reorder floating point ops as it changes precision/output, so we have to try and write shaders so they generate fma.
- Basically, make sure to write stuff as a * b + c instead of c + a * b

Mar 20 '24 21:03 Elabajaba

Improve MeshUniform inverse matrix calculation performance

This may be supplanted on platforms where compute shaders are present by #12773.

Faster MeshUniform serialization

https://github.com/teoxoy/encase/pull/65 should make encase as fast as bytemuck based approaches.

Mar 30 '24 10:03 james7132

#12773 also effectively bypasses encase on platforms where compute shaders are present, since MeshInputUniform uses bytemuck.

Mar 30 '24 20:03 pcwalton

bevy bevy copied to clipboard

Renderer optimization tracking issue

Optimizations for general usage

Optimizations for specific use cases

bevy
bevy copied to clipboard