bevy
bevy copied to clipboard
Renderer optimization tracking issue
There are many ideas, branches, proofs of concepts, PRs, and discussions around improving the performance of the main code paths, systems, data structures, etc for rendering entities with meshes and materials.
This issue is a tracking issue to help with giving an overview of what has been considered, is known/has been tried, is almost ready but needs finishing off, needs reviewing, or has been merged. It should help to sequence work, and avoid forgetting things.
Optimizations for general usage
- [x] Consolidate
RenderAssets
andRenderMaterials
- Problem:
-
RenderMaterials
exists to work around limitations of theRenderAssets
API -
RenderMaterials
is duplicated across 3D, 2D, UI, gizmos, anywhere where there is a duplicate of theMaterial
API abstraction - Making a change to 'render assets' involves modifying not only
RenderAssets
but alsoRenderMaterials
and all its duplicates
-
- Solution:
- Make the generic type argument of
RenderAsset
be the target type (e.g.GpuMesh
instead ofMesh
). This removes the root cause that prevented reuse ofRenderAssets
for materials.
- Make the generic type argument of
- Status: #12827
- Problem:
- [ ] Split per-instance buffer dynamic offsets, and instance/batch ranges out of
PhaseItem
- Problem:
- Queuing, sorting, batching, rendering all interact with
RenderPhase
which contains aVec<PhaseItem>
-
PhaseItem
contains aRange<u32>
(8 bytes) for the instance/batch range, and anOption<NonMaxU32>
(4 bytes) for the dynamic offset in case of using aBatchedUniformBuffer
. These take space in caches, and are more data to move around when sorting.
- Queuing, sorting, batching, rendering all interact with
- Solution:
- Queuing and sorting don't know anything about the instance/batch range nor dynamic offsets because they only get calculated in
batch_and_prepare_render_phase
- This leads to the conclusion that batch ranges and dynamic offsets should be moved out of
PhaseItem
s. - They could be moved into separate
Vec
orEntityHashMap
inRenderPhase
, or separate components with similar data structures on the view, generic over thePhaseItem
to allow there to be one per phase per view. The latter enables easier parallelism through ECS queries (Arc<Mutex<T>>
members inRenderPhase
would solve this too), but is perhaps a bit more awkward.
- Queuing and sorting don't know anything about the instance/batch range nor dynamic offsets because they only get calculated in
- Status: proof of concept branch by @superdump
- Problem:
- [ ] Split
MeshUniform
GpuArrayBuffer
preparation frombatch_and_prepare_render_phase
- Problem:
-
batch_and_prepare_render_phase
is a bottleneck -
batch_and_prepare_render_phase
prepares the per-instance data buffer because when usingBatchedUniformBuffer
on WebGL2/where storage buffers are not available, batches can be broken by filling up a uniform buffer binding (16kB minimum guaranteed, 64kB on NVIDIA, can be more on others) such that a new dynamic offset binding has to be started.
-
- Solution:
- Implement a
BatchedUniformBuffer
dynamic offset calculator - Split preparation of the per-instance
GpuArrayBuffer
into aprepare_render_phase
system that is separate frombatch_and_prepare_render_phase
and can be run in parallel with it. Renamebatch_and_prepare_render_phase
tobatch_render_phase
- Implement a
- Status: proof of concept branch by @superdump
- Problem:
- [ ] Split pipeline specialization out of queue systems
- Problem:
- Queue systems do a lot of work to gather information from views, mesh assets, and material assets, including multiple lookups and calculations of pipeline specialization keys.
- Queue systems are a bottleneck
- The design of queue material mesh systems is such that we can re-specialize for every visible entity for every render phase for every view. This is maximum flexibility.
- View configuration, mesh/material asset properties should only change rarely. In the vast vast majority of cases, view/mesh/material properties that impact pipeline specialization are configured once, at initialization/load time, and never again.
- Compiling pipelines at runtime while a user is interacting with the application causes hitches and long-frames. In real-world games and applications, pipelines should ideally be prepared ahead of time (not necessarily offline, but before the start of a scene/level/area of a scene) so that when they are being used, there are no hitches.
- The specialized pipelines design is highly optimized for fast lookups, but it also is doing a lot of unnecessary work.
- Even though the specialized pipelines design is highly optimized, a lot of information has to be looked up to prepare the pipeline specialization keys to be able to do the lookup.
- Solution:
- Instead of specializing pipelines per visible entity per render phase per view every frame, we can specialize only when something changes
- In the main world in
PostUpdate
:- Detect whether anything impacting a view's contribution to the pipeline specialization key changes. If it does, mark it as a 'dirty' view that needs respecialization
- Detect whether a mesh or material asset, or the handles to those on an entity have changed, if so, mark that entity as 'dirty' and in need of respecialization
- When extracting, if an entity needs respecialization, gather the data needed to do so. If an entity was already specialized, fetch its pipeline id
- Maintain mappings from assets to entities using those assets for later fast gathering of entities that need respecialization
- In
PrepareAssets
:- Process
Asset
events to additionally identify entities that need re-specialization (if they use that asset) - Re-specialize all pipelines that need it and then update all pipeline ids for affected entities
- Process
- Queue then simply looks up the pipeline id for the entity and uses it. This removes the need for much of the mesh/material asset lookups which is better for cache coherency in this very hot loop.
- Status: proof of concept branch by @superdump
- Problem:
- [x] Support binned
RenderPhase
s for opaque (note that alpha mask is also opaque) passes (including opaque pre and main passes, and shadow passes)- Problem:
- Queuing, sorting, and batching are bottlenecks
- Opaque passes don't technically need to be absolutely ordered. Sorting from front to back enables graphics hardware to do early depth testing to know that they don't need to fragment shade a fragment that is further away than a previously-shaded fragment, but this is a possible optimization and perhaps other approaches give larger wins.
- Solution:
- Opaque entities can be batched by various properties (pipeline, bind groups, dynamic offsets, etc) and their order doesn't strictly matter
- Create a bin key based on those properties, and bin opaque entities into groups based on the bin key, e.g.
HashMap<BinKey, Entity>
- Sort the bin keys
- Batching is then just iterating the bin keys and modifying the batch range and dynamic offset of an entity that can be later used to look up bindings and such for encoding the draw command
- Status: #12453
- Problem:
- [ ] Improve
MeshUniform
inverse matrix calculation performance- Problem:
- Calculating inverse matrices is expensive, part of a bottleneck, and can be optimized further
- Solution:
- Try using
ultraviolet
or another similar 'wide' SIMD crate to enable calculating many matrix inverses in parallel, instead of using 'vertical' SIMD likeglam
does and calculating for one at a time.
- Try using
- Status: Idea
- Problem:
- [ ] Batch directly into a 'draw stream'
- Problem:
- Batching and encoding render passes (render graph node execution for pre/shadow/main passes) is slow and a bottleneck
- Batching compares properties of drawable entities to see if they can be drawn together
- Render pass draw command encoding uses
TrackedRenderPass
to keep track of draw state so that when draw commands (binding pipelines and buffers, updating dynamic offsets, etc) are issued to theTrackedRenderPass
, it can compare and see if anything changed, and if not, it can skip passing the call on towgpu
. This means information is being compared twice, both in batching and rendering. - Both batching and rendering have to look up all the information needed to either identify batchability, or encode draw commands, so the lookups are being done twice too.
- Solution:
- Batch directly into a draw stream
- A draw stream is a concept from Sebastian Aaltonen's work for the HypeHype renderer. It is basically a
Vec<u32>
with a protocol. The firstu32
is a bit field for a single draw command that contains bits indicating for example whether a pipeline needs to be rebound, a bind group needs rebinding, if there is an index/vertex buffer to be rebound, the type of draw (indexed vs not, direct vs inirect, etc). Then theu32
s that follow contain the ids or information to be able to encode that draw. - The result is that lookups happen once when batching, and
TrackedRenderPass
is no longer needed in terms of checking whether something actually needs rebinding or so, because the draw stream only contains exactly what needs to be done. - Another fallout is that draw functions are no longer what is executed during draw command encoding. They should probably instead move to the batching stage or before, if necessary. This part is not yet figured out.
- Status: proof of concept branch by @superdump. Needs more design work to figure out what needs to be done with DrawFunction/RenderCommand.
- Problem:
- [ ] Faster
MeshUniform
serialization- Problem:
-
MeshUniform
serialization is a bottleneck.encase
performance is part of the problem.
-
- Solution:
- Use manual padding and alignment, and
bytemuck
to bypassencase
- Use manual padding and alignment, and
- Status: proof of concept branch by @superdump
- Problem:
- [ ] Write directly to
wgpu
staging buffers, avoiding memory copies- Problem:
- Bevy often has multiple copies of data such as per-instance data, sometimes a rust-side
Vec<T: ShaderType>
, that is then serialised into aVec<u8>
usingencase
, that is then given towgpu
'sQueue::write_buffer()
API. This results in making multiple copies, which costs performance
- Bevy often has multiple copies of data such as per-instance data, sometimes a rust-side
- Solution:
- Use
wgpu
'swrite_buffer_with()
APIs. This allows requesting a mutable slice into an internalwgpu
staging buffer. Serializing data directly into this mutable slice then avoids lots of unnecessary copies.
- Use
- Status:
- #12489 for the
GpuArrayBuffer<MeshUniform>
- Idea - there will be many more call sites that will need modifying to use this pattern
- #12489 for the
- Problem:
- [ ] Parallelize
MeshUniform
buffer preparation acrossRenderPhase
s- Problem:
-
batch_and_prepare_render_phase
usesResMut<GpuArrayBuffer<MeshUniform>>
which means that these systems run serially as there is only oneGpuArrayBuffer<MeshUniform>
resource and only one system can mutate it at a time
-
- Solutions:
- Use a
GpuArrayBuffer<MeshUniform>
per phase per view- This adds quite a lot of code complexity. Both @james7132 and @superdump have tried implementing it and it gets messy.
- Get writers into mutable slices of the single
GpuArrayBuffer
per phase per view, and then prepare each phase for each view in parallel
- Use a
- Status: #12489
- Problem:
- [ ] Material data in arrays
- Noted in #89
- Problem:
- Each material instance has its own bind group due to material data being prepared into an individual uniform buffer for that instance in
AsBindGroup
- Entities using the same material type and same material textures cannot be batched if their material data is different due to needing to bind a different bind group
- Each material instance has its own bind group due to material data being prepared into an individual uniform buffer for that instance in
- Solution:
- Modify
AsBindGroup
to write material data into aGpuArrayBuffer
per material type - Add the material index into per-instance data
- Modify shaders to look up material data using the index from the per-instance data
- Modify
- Status: proof of concept branch by @superdump
- [ ] Use one set of large mesh buffers per vertex attribute layout
- Noted in #89
- Problem:
- Index/vertex buffers have to be re-bound when the mesh changes. This adds overhead when encoding draws and when drawing. It also prevents some optimisations like being able to draw all objects for shadow mapping for a light in one draw.
- Solution:
- Write all mesh data for meshes with the same vertex attributes into one large index buffer and one large vertex buffer
- Use an appropriate allocator like a port of Sebastian Aaltonen's offset allocator to manage allocation
- Manage index/vertex ranges for individual assets such that encoding a draw then only binds index/vertex buffers once and just gives the index/vertex range to the draw command.
- Status: Idea
- [ ] Split large buffers of vertex attributes into one for position, one for everything else
- Problem:
- Cache miss rate in shadow passes is high due to most vertex attributes being completely irrelevant
- Solution:
- Split out position from other vertex attributes and put it into its own vertex buffer
- This enables shadow passes for opaque entities to load only the position vertex buffer and get much higher cache hit rate and much better performance
- Status: Idea
- Problem:
- [ ] Support texture atlasing in StandardMaterial
- Problem:
- Entities can't be batched if they have different textures
- Solution:
- Use texture atlases, possibly in array textures, to reduce the number of separate texture bindings that are needed
- Add support for UV offset/scale etc as needed to map from vertex UVs to atlas UVs
- Status: Idea
- Problem:
- [ ] Support bindless texture arrays
- Noted in #89
- Problem:
- Entities can't be batched because they have different textures, and texture atlasing requires preprocessing of image assets that may not be possible/ideal to do at runtime given one wants to use compressed GPU texture formats that are very slow to compress
- Solution:
- Support bindless texture arrays where all textures are bound in one big array
- Store indices into bindless texture arrays in material data
- Status: Idea
- [ ] Small generational indices for assets
- Problem:
- Queuing, sorting, batching, preparation, and rendering are bottlenecks
-
AssetId
is quite large due to having an UUID variant (16 bytes) which means slower hashing and worse cache hit rates (larger data uses more space in caches). - Loads, stores, hashes, and comparisons of
AssetId
s are done in all those bottleneck systems, either directly, or just because of being part ofPhaseItem
s.
- Solutions:
- @cart 's asset changes, possibly including
Asset
s asEntity
s-
Uuid
is removed!
-
- Slotmap render assets
-
RenderAssets
could instead use aSlotMap<PreparedAsset>
, andHashMap<AssetId, SlotMapKey>
. - A
SlotMapKey
is au32
+u32
generational index, so 8 bytes. - On extraction, the
SlotMapKey
is looked up for theAssetId
, or theAssetId
is extracted to a separate queue of assets to be prepared and theSlotMapKey
is later backfilled into the extracted data type after preparation is complete. - After extraction and asset preparation, the render app then exclusively uses the
SlotMapKey
to look up thePreparedAsset
which avoids hashing entirely, is more cache friendly, and means less data to be sorted.~~ - Status: https://github.com/bevyengine/bevy/pull/13013 by @superdump
-
- Asset indexify
- Make
Handle
andAssetId
into structs containingAssetIndex
andOption<Uuid>
- Make
const Handle
intoconst Uuid
and maintain aHashMap<Uuid, Handle<T>>
, create the handles at runtime and insert into the map, look up from the map using theconst Uuid
- Status: https://github.com/superdump/bevy/tree/asset-indexify by @superdump which had better performance than slotmap render assets, but won't be merged, in favour of waiting on @cart 's coming asset changes
- Make
- @cart 's asset changes, possibly including
- Problem:
Optimizations for specific use cases
- [ ] Improved directional light shadow cascade culling
- Problem:
- Culling is done based on the direction light shadow cascade's frustum, which covers a much larger volume than the camera frustum slice corresponding to the cascade. This means we prepare for shadow casting onto regions outside the camera frustum
- Solution:
- Cull to camera frustum slice bounds in light space
- Status: proof of concept branch by @superdump with some bugs
- Problem:
- [ ] Use storage buffers for animation data to enable batching
- Problem:
- Animated meshes cannot be batched, resulting in many individual draws despite entities being otherwise batchable
- Solution:
- Aside from WebGL2, storage buffers can be used.
- Write animated mesh data into a runtime-sized array in a storage buffer and write the index into that array into per-instance data. This enables one single binding to be used for all animated entities, and so they can be drawn in one batch if they meet the other requirements for batching. This gives a big performance boost.
- Status: #10094
- Problem:
- [ ] Remove the built-in sprite renderer and use 2D material meshes
- Problem:
- Sprite rendering is very optimized, but is very hard-coded and inflexible. No custom materials.
- 2D material mesh rendering has not been as fast as sprite rendering.
- Sprite rendering can only use quads, which incur significant overdraw that is a limiting factor for performance in some cases.
- Solution:
- Use 2D material meshes for rendering sprites
- The same or sprite-focused simplified main world components/assets can be used, but they can be translated into material meshes
- Sprite rendering can then continue to benefit from all the subsequent optimizations made to the material mesh rendering infrastructure
- Status: Idea
- Problem:
Other things to add:
- [ ] Shadow atlasing for improved performance
- This also allows users to set a fixed VRAM budget for shadows. Having a fixed budget means we also need to figure out how to gracefully and automatically degrade shadows based on distance (or some other metric) to fit everything into the atlas.
- Caching this between frames and only updating shadow views when necessary (eg. don't update a view if it was static) can be a nice performance win.
- [ ] Reduce our register pressure in shaders
- This is something that we'll eternally be fighting against, especially as more features are added.
- Currently our main_opaque_3d fragment shader has insane register pressure due to how many VGPRs we're using, which means we get poor throughput. Forward shading is expected to have high register pressure, but bevy's is currently insanely high.
- See https://gpuopen.com/learn/occupancy-explained/ for an explanation of GPU occupancy
- See https://flashypixels.wordpress.com/2018/11/10/intro-to-gpu-scalarization-part-1/ and https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/ for some articles on how to improve occupancy.
- RGA can be used if you dump bevy's shaders to spirv to view what's actually causing VGPR pressure, which is what's currently causing the low occupancy.
- Nvidia Nsight and AMD RGP can show what the register pressure is in a capture. Intel GPA and Apple's Xcode gpu debugger can probably show this. Idk about android.
- [ ] FMA all the things.
- Afaik shader compilers won't reorder floating point ops as it changes precision/output, so we have to try and write shaders so they generate fma.
- Basically, make sure to write stuff as
a * b + c
instead ofc + a * b
Improve MeshUniform inverse matrix calculation performance
This may be supplanted on platforms where compute shaders are present by #12773.
Faster MeshUniform serialization
https://github.com/teoxoy/encase/pull/65 should make encase
as fast as bytemuck
based approaches.
#12773 also effectively bypasses encase
on platforms where compute shaders are present, since MeshInputUniform
uses bytemuck
.