bevy icon indicating copy to clipboard operation
bevy copied to clipboard

GPU Instancing

Open cart opened this issue 3 years ago • 50 comments

The Bevy renderer should be able to instance entities that share a subset of the same properties.

cart avatar Aug 05 '20 02:08 cart

What is the dependency relationship between this issue and #179 ? It seems like instancing would be a significant part of refactoring the rendering pipeline and ideally we'd want to do the work for that issue in a way that doesn't require a significant amount of refactoring.

Perhaps we could simply merge the two and say that instancing is a requirement of that rendering pipeline.

chrisburnor avatar Dec 30 '20 18:12 chrisburnor

I don't see much of a need to couple them together. They will touch some of the same code, but in general I think they are separate features. I'm not convinced we need to block one on the other.

GPU instancing is a matter of grouping instanced entities together, writing values that are different to vertex attributes, and making a single "instanced" draw call.

Most of PBR is shader work, and anything that's more complicated than that (ex: shadow maps) won't have too much bearing on the instancing work.

cart avatar Dec 30 '20 19:12 cart

I've been thinking about this a bit. Sorry in advance for the content drop.

Entities that could be instanced together would have to have some things in common:

  1. Same mesh
  2. Same Render Pipeline
  3. Same bind groups for slots which are not instanced

For the interface I would propose to add a InstancedRenderResourcesNode<T, Q> similar to RenderResourcesNode. Like RenderResourceNode it has a type parameter T: RenderResource, but it has another type parameter Q: WorldQuery. Only entities for which fetching Q returns the same value get instanced together. Q could for example be (Handle<Mesh>, Handle<Material> in a normal use case. For each value of Q there must be exactly one entity with a RenderPipelines component, into which the generated bindings can be inserted.

Introducing this query parameter is a bit ugly, but I don't think there is a way around it. Which entities can be batched together depends on the render graph / pipeline and could theoretically be detected automatically, but only during the DRAW stage, which is of course to late. The advantage is that the query parameter makes this approach very flexible and would cover almost any use cases.

The only difficulty in implementing this, I think, is managing the buffer memory space. As with the normal RenderResourceNode it would make sense to write all the values into the same buffer. The difference here is that values for the same draw call must lie continuously in the buffer, which gets complicated when new entities are added / removed.

I would like to try to implement this, but wanted to see if there is any feedback first.

MDeiml avatar Feb 05 '21 09:02 MDeiml

I'm a little confused about the "single entity with RenderPipelines component" part. Where would the entity come from / how would we choose which one to use?

In general I like the idea, but I think the node should probably manage its own RenderResourceBindings and produce its own draw commands instead of storing that in a random entity.

Theres also the matter of using "dynamic uniform buffers" for each instance's data or "vertex attribute buffers". "vertex attribute buffers" might perform better, but they also have more limitations for data layout. If you were to google "gpu instancing example", in general you'd find vertex buffer implementations. We'd probably want to do a quick and dirty performance comparison of the two approaches before investing in one or the other.

cart avatar Feb 06 '21 01:02 cart

I'm a little confused about the "single entity with RenderPipelines component" part. Where would the entity come from / how would we choose which one to use?

I don't really like this approach either, but I didn't see any other solution. The problem I'm trying to solve with this is that it would be wasteful to give each instanced entity it's own Draw and RenderPipelines component, as is the case with entities that have a MeshBundle. But the draw commands would have to be stored somewhere, and the only way to get the PassNode to render something is to store the draw commands in a Draw component of some entity, if I understand this correctly.

The other idea with this was to allow for really lightweight entities (e.g. particles), that wouldn't even need to have a Handle<Mesh> or Handle<Material>. But this information then would have to be stored somewhere else. It compromises ease of use for flexibility. I'm not really sure what's more important here.

In general I like the idea, but I think the node should probably manage its own RenderResourceBindings and produce its own draw commands instead of storing that in a random entity.

This would generally be nicer, but as I said, would mean changing the PassNode and storing the draw commands somewhere else than in the World.

Also it would probably mean that there could only be one InstancedRenderResourcesNode at once, which I notice now is probably also a limitation of the general approach.

Theres also the matter of using "dynamic uniform buffers" for each instance's data or "vertex attribute buffers". "vertex attribute buffers" might perform better, but they also have more limitations for data layout. If you were to google "gpu instancing example", in general you'd find vertex buffer implementations. We'd probably want to do a quick and dirty performance comparison of the two approaches before investing in one or the other.

I will try and test that. There is also "storage buffers", but I guess they are strictly worse than uniform buffers for this.

MDeiml avatar Feb 06 '21 15:02 MDeiml

Uniform buffers seem to have a (minimum) size limit of 16384 (https://docs.rs/wgpu/0.7.0/wgpu/struct.Limits.html#structfield.max_uniform_buffer_binding_size) meaning about 256 instances if each instance needs a mat4. After some google searches I would guess that up to that limit they are faster than vertex buffers, but then vertex buffers take over.

MDeiml avatar Feb 06 '21 20:02 MDeiml

I don't use Bevy so I don't really have much at stake here, but I think this issue is a really important one to solve. Currently the 3D spawner example runs at about 1 frame per 3 seconds for me on my little laptop iGPU, with only about 15% gpu usage at maximum (according to intel_gpu_top). The rest of the time is probably spent in the driver copying buffers and recording draw calls.

expenses avatar Apr 08 '21 09:04 expenses

I just want to leave this link here, as it is related to the topic at hand, and it states that using the Uniform Buffers like that has performance implementations: https://sotrh.github.io/learn-wgpu/beginner/tutorial7-instancing/

It is the draw_indexed function of WGPU. Not sure how that helps, as I am not clear on Bevy's rendering system quite yet.

Skareeg avatar May 07 '21 03:05 Skareeg

Instancing is a very complex subject. Babylon supports 3 different types of instancing:

  • Normal instances which are found in basically every game engine.
  • Thin instances are faster than normal instances, but they are not frustrum culled (so either all the instances are drawn or none of them are drawn).
  • Particle instances which renders a mesh on each particle. It supports multiple different meshes for a single particle system.

In addition, there are other useful types of instancing, such as GPU skeleton animation instancing. This allows you to have multiple instanced objects (with a skeleton) and each object is playing a different animation:

http://developer.download.nvidia.com/SDK/10/direct3d/Source/SkinnedInstancing/doc/SkinnedInstancingWhitePaper.pdf

https://forum.babylonjs.com/t/vertex-animation-textures/6325

https://forum.babylonjs.com/t/animations-and-performance-tips-and-tricks/20107/4

https://www.html5gamedevs.com/topic/32313-instancedmesh-with-separate-skeleton/?tab=comments#comment-185468

Of course it will take time to implement all of that, so it should be a long term goal, but it should be kept in mind when designing the system.

Pauan avatar May 27 '21 14:05 Pauan

The first thing to settle should probably be how instancing should look from the user / ecs perspective. There are a few components we need to worry about:

  • Draw for storing draw calls (also Visible but I'm going to ignore it for now)
  • Resources that are instanced (e.g. Transform). These should be read by a InstancedRenderResourcesNode of some kind and stored in a uniform or vertex buffer
  • Resources that are not instanced (e.g. Material, Mesh). These have to be the same for every instance in the same draw call
  • Some kind of marker specifying which entities to instance together. Lets call it Instanced

Now as has come up here there are at least two different use cases that could be supported:

  1. Normal "automatic" instancing for situations where we need good performance, but not the best performance. Entities should probably look the same as normal entities with some added marker. So the typical entity would have components like Transform, Material, Mesh, Draw, Instanced, ... (here Instanced would maybe not have to contain any additional information, as entities could be automatically grouped by their material and mesh)
  2. "Manual" instancing, where performance is very critical for example in particle systems or situations where a very large number of instances is needed. For this entities should be kept very slim and not have any Draw component or have separate copies of the not instanced resources (Material, Mesh, ...). Instead this data should be stored in some "parent" entity. Here a typical entity could for example only have the components Instanced and Transform. Instanced here should contain some instance_id or reference to the parent. All the other information would be stored in the "parent" entity which would have the components Material, Mesh, Draw and some reference to the instance_id

I think it should be possible to cover both use cases with one implementation, though use case 1 is definitely more important in the short term. I'd be happy to hear what others think about this.

MDeiml avatar Oct 04 '21 14:10 MDeiml

In the 3D coding examples, we just add default plugins, and add some 3D geometry. Done. The crafty 3D programmer wants to plan draw calls, saving time by using instanced rendering.

So, perhaps we need an "Instanced" Component that can be implemented for a piece of 3D geometry? This Component needs a model, a shader, and one or more vertex buffers describing instance positions, colors or whatever else is referenced in the shader(s).

If the renderer gives us access to frustrum culling, we can easily add thin instances too.

Particle instancing shouldn't be hard. We just need a way to group the 3D models and refer to them as a group.

I will start by reading through the renderer code and seeing how we're setting up draw calls.

jpryne avatar Jan 27 '22 00:01 jpryne

A bit of a fly-by comment, but I just wanted to bring up the possibility of supporting instanced stereo rendering for VR/AR, which is likely to interact with any general abstraction for supporting instances.

The main thing that I think is relevant here is that any abstraction Bevy introduces for rendering instances should internally reserve the ability to basically submit 2x the number of instances as requested so then shaders can use modulo arithmetic on the instanceId to lookup per-eye state, such as the per-eye projection matrix.

I'm not sure if Bevy already has any kind of macro system for shaders but for being able to write portable shaders that can work with or without instanced stereo rendering it would then be good to have a standard bevy macro for accessing the 'instance id' for applications (which might be divided by two when hiding the implementation details of the stereo rendering so as to not break how the shaders handle their per-instance data)

rib avatar Mar 02 '22 17:03 rib

@rib I just want to say that the ideal way to do stereo rendering for VR would be to use an api extension such as VK_KHR_muktiview for Vulkan: https://www.saschawillems.de/blog/2018/06/08/multiview-rendering-in-vulkan-using-vk_khr_multiview/

Anything else sounds like it would get messy, fast.

expenses avatar Mar 03 '22 00:03 expenses

Thanks @expenses yeah good to bring up. As it happens I'm reasonably familiar with these extensions since I used to work on GPU drivers at Intel and even implemented the OpenGL OVR_multiview equivalent of this extension.

I suppose I tend to think the extensions themselves aren't that compelling since what they do can generally be done without the extension (they don't depend on special hardware features), but also I recall conversations within Khronos (years ago at this point, so things might have changed), where at least one other major vendor was intentionally avoiding implementing these extensions due to them being somewhat redundant (basically a utility over what apps can do themselves) so I guess I wouldn't be surprised if the extensions aren't available across all platforms.

I haven't really kept up with what different vendors support now though, so maybe the extensions really are available across the board now. A quick search seems to suggest the vulkan extension is pretty widely supported now (https://vulkan.gpuinfo.org/listdevicescoverage.php?extension=VK_KHR_multiview) but the story is still maybe a bit more ugly for GL.

I think it's perhaps still worth keeping in mind being able to support single pass stereo rendering through instancing without necessarily always having a multiview extension available. Being able to 2x the requested instance count for draw commands could potentially be pretty simple to support so I wouldn't necessarily assume it would be that messy to support if it's considered when thinking about how to support gpu instancing in bevy. Some of the requirements like framebuffer config details or viewid shader modifications are going to be pretty similar with or without any multiview extensions I'd guess - it's mainly the bit about doubling the instance counts that would be unique to a non-multiview path.

For reference, in unity they seem to support this ok, and the main caveat is with indirect drawing commands where the engine can't practically intercept the requested instance counts and so they document for those special cases that you have to 2x the instance count yourself.

rib avatar Mar 05 '22 04:03 rib

@superdump, my impression is that this is now supported. Is that correct?

alice-i-cecile avatar Apr 26 '22 23:04 alice-i-cecile

It is supported via our low / mid-level renderer apis, by nature of wgpu supporting it (and we have an example illustrating this). I don't think we should close this because what we really need is high level / automatic support for things like meshes drawn with StandardMaterial.

cart avatar Apr 26 '22 23:04 cart

Yeah. My gut is leading me to sorting out data preparation stuff first. So things around compressed textures and ways of managing data with uniform/storage/texture/vertex/instance buffers, including using textures for non-image data and things like texture atlases and memory paging to make blobs of memory suitable for loading stuff in/out without having massive holes. Then I imagine as that enables more things to be done, bindings (bind group layouts and bind groups) would become the focus to be able to batch things together. At least for me, fiddling with these things will increase my understanding and lead me toward figuring out how to do batching in a good and automatic way. This thread is useful for understanding use cases too, like VR stereo and the different ways that Babylon.js does instanced batching.

superdump avatar Apr 27 '22 05:04 superdump

Unity has now moved away from GPU instancing and instead relies more on the "SRP Batcher" (https://docs.unity3d.com/Manual/GPUInstancing.html, https://docs.unity3d.com/Manual/SRPBatcher.html). The SRP Batcher basically reorders draw calls to reduce render state switches. It seems that they came to the conclusion, that (at least for Unity) draw call batching is more performant than instancing. Maybe Bevy should also go that route, seeing that it would also mean that shaders don't have to be set up for instancing.

From what I understand bevy at the moment orders all draw calls by distance. For transparent object that's necessary, but for opaque / alpha-masked objects the performance benefit of reducing overdraw by ordering draw calls should probably be less than optimizing for less state switches.

Now admittedly I'm not an expert in this, so maybe someone with more experience in graphics programming could give their opinion on this?

But I this shouldn't be too hard to implement since we already have code to e.g. collect all uniforms into one buffer.

EDIT: I think I was mistaken. In Unity the SRP Batcher doesn't even reorder anything. It just avoids state switches by remembering the current state, so an implementation would probably only mean minor changes in bevy_core_pipeline. Or is there something I'm missing?

MDeiml avatar May 12 '22 15:05 MDeiml

@MDeiml Note that Bevy uses WebGPU and Vulkan, so the cost of context switching is going to be very different compared to something like OpenGL. So any decisions should be based on benchmarking real Bevy apps, to make sure that we're not optimizing for the wrong thing.

Pauan avatar May 12 '22 15:05 Pauan

Unity has now moved away from GPU instancing... It seems that they came to the conclusion, that (at least for Unity) draw call batching is more performant than instancing.

This doesn't really seem to be the case. https://forum.unity.com/threads/confused-about-performance-of-srp-batching-vs-gpu-instancing.949185/

According to this thread, GPU instancing should perform better than the SRP Batcher, but it is only applicable when all the instances share the same shader and mesh.

dyc3 avatar May 18 '22 15:05 dyc3

I was meaning to leave a comment about this too...

GPU instancing is a lower-level capability supported by hardware which makes it efficient to draw the same geometry N times with constrained material / transform changes made per-instance, considering that there is no back and fourth between the CPU and GPU for all of those instances.

It's not really a question of using instancing vs batching, they are both useful tools for different (but sometimes overlapping) problems.

If you happen to need to draw lots of the same mesh and the materials are compatible enough that you can practically describe the differences via per-instance state then then instancing is likely what you want ideally.

On the other hand if you have lots of smallish irregular primitives that are using compatible materials (or possibly larger primitives that you know are static and can be pre-processed) then there's a good chance it's worth manually batching them by essentially software transforming them into a single mesh, and sometimes transparently re-ordering how things are drawn for the sake of avoiding material changes. Batching can be done at varying levels of the stack with more or less knowledge about various constraints that might let it make fast assumptions e.g. for cheap early culling and brazen re-ordering that allows for more aggressive combining of geometry.

Unity's SRP batching is quite general purpose so it's probably somewhat constrained in how aggressive it can be without making a bad trade off in terms of how much energy is wasted trying to batch. On the other hand UI abstractions can often batch extremely aggressively.

Tiny quads, e.g. for a UI could be an example of an overlap where it might not always be immediately obvious whether to instance or batch. Quads are trivial to transform on the CPU and you can easily outstrip the per-drawcall overhead (especially with OpenGL) and it's potentially worth cpu transforming and re-ordering for optimal batching compared to submitting as instances where you'd also have to upload per-quad transforms.

rib avatar May 18 '22 16:05 rib

Let's get back on track. I'm going to summarize the conversation so far just to make sure we are on the same page. Let me know if I missed anything and I'll update this comment so we can keep this conversation a little less cluttered.

The Conversation So Far

What is GPU Instancing?

GPU Instancing is a rendering optimization that allows users to render the same object lots of times in a single draw call. This avoids wasting time repeatedly sending the mesh and shader to the GPU for each instance. Each instance has parameters that changes how it's rendered (eg. position).

Current Status

We have successfully determined that GPU instancing is a worthwhile effort. We have also established that instancing is different from batching. GPU instancing is technically currently possible in Bevy, as shown in this example, but this is only possible through low level APIs. This example also requires disabling frustum culling, which doesn't seem ideal. This issue is about making GPU instancing more easily accessible to users.

In order to use instancing, the objects in question must share the same shader and mesh. The instances are provided instance data that contains data unique to that instance of the object (eg. position, rotation, scale).

This will take the form of 2 use cases, both of which seem reasonably feasible and should be easy enough to cover in a single implementation:

  1. Automatic instancing, where Bevy just does it automatically.
  2. Instancing with custom user defined parameters. The user has a custom shader that can take custom parameters, and each instanced entity has a component to provide the custom parameters to provide to the shader.

The VK_KHR_multiview vulkan extension and the OVR_multiview OpenGL extension should adequately handle instanced objects for VR applications, but it could be possible for these extensions to not be available. @rib suggested being able to submit 2x the requested instance amount as a workround when multiview is not available.

What we need to decide

  • The user facing API. So far, it's a little unclear what the user facing API will look like.
  • High level implementation details, see this comment and this comment.

How Other People Do It

There are plenty of other engines that implement instancing. These may be useful to reference when we are designing the user facing API.

  • Unity: https://docs.unity3d.com/Manual/GPUInstancing.html These docs contain an interesting note about performance:

Meshes that have a low number of vertices can’t be processed efficiently using GPU instancing because the GPU can’t distribute the work in a way that fully uses the GPU’s resources. This processing inefficiency can have a detrimental effect on performance. The threshold at which inefficiencies begin depends on the GPU, but as a general rule, don’t use GPU instancing for meshes that have fewer than 256 vertices. If you want to render a mesh with a low number of vertices many times, best practice is to create a single buffer that contains all the mesh information and use that to draw the meshes.

  • Babylon.js: https://doc.babylonjs.com/divingDeeper/mesh/copies/instances

dyc3 avatar May 20 '22 14:05 dyc3

I think a particular detail that's worth highlighting under 'How Other People Do It', looking at Unity is that they provide a macro for accessing the instance ID in shaders that should be used to ensure the engine has the wiggle room it needs to be able to change the number of instances submitted in a way that's transparent to the application.

Ref: https://github.com/TwoTailsGames/Unity-Built-in-Shaders/blob/master/CGIncludes/UnityInstancing.cginc

I'm not familiar yet with whether Bevy has any kind of similar macro system for shaders, but something comparable could make sense.

rib avatar May 20 '22 16:05 rib

Just wanna share a usecase that I'm interested in and thinking a lot about: emulated dynamic tessellation.

I'm working on implementing some special parametric surface patches that can fluctuate between being extremely small and extremely large very quickly. When they're small I want to render them as single cells but when they're large I want to compute lots of interpolated detail. So I made an atlas of pre-computed tesselations of unit tris/quads that blend between arbitrary levels of detail on each side. The tessellation geometry then gets instance drawn for each surface patch using uniforms representing the corners and interpolation parameters of each patch. These uniforms and the LOD levels get pre-computed in a compute pass with transform feedback.

Initial prototypes using WebGL2 with a single level of detail have shown surprisingly good performance for rendering a lot of instances of a single high LOD tessellation mesh. Dynamic LOD rendering is a little trickier and still a work in progress but the idea is similar. The plan is to combine all of the tessellation levels into a single mesh and then to make use of the WEBGL_multi_draw_instanced_base_vertex_base_instance draft extension for WebGL2. This will allow rendering multiple arbitrary sections of the instance array (uniforms for one or more patches) over multiple arbitrary sections of the mesh (various LOD tessellations) all using a single draw call. Coming up with the draw call parameters will be a little tricky. For WebGL2 this needs to happen on the CPU since there's no indirect rendering, but I have a scheme in mind to make it quick by precomputing an index of neighboring patches.

As for WebGPU, there's not multidraw support yet but this will come eventually. In the meantime WebGPU already supports instance drawing starting from an arbitrary firstInstance of instance array.

My wish now is for a glTF/PBR renderer that could draw like this but in a pluggable way. Aside from rendering parametric/generative surfaces (e.g. terrain, Bezier surfaces), this could also be used for displacement mapping.

micahscopes avatar May 20 '22 17:05 micahscopes

For the general discussion: I have been thinking about this and playing around with instancing and batching in a separate repo. I would say:

GPU instancing is specifically drawing the same mesh from a vertex buffer, optional index buffer, and instance buffer (vertex buffer that is stepped at instance rate) by passing the range of instances to draw to the instanced draw command.

But, as noted in the Unity documentation, GPU instancing is inefficient for drawing many instances of a mesh with few vertices as GPUs spawn work across 32/64/etc threads at a time and if they can’t due to there only being 4 vertices to process for a quad for example then the rest of the threads in a ‘warp’ or ‘wavefront’ are left idle which is called having low occupancy and leaves performance on the table.

As such, I think it is very important to consider other ways of instancing and also consider batching. So I should define what I understand those terms to mean.

General instancing is using the tools available to draw many instances of a mesh, not necessarily by passing the range of instances to be drawn to a draw command.

Batching is using the tools available to merge multiple draw commands into fewer draw commands. It was noted already that merging draw calls for APIs like OpenGL is much more significant a benefit than doing the same for modern APIs, but there is still benefit to be had.

Also of consideration here is that generally speaking if a data binding that is used in a bind group has to change between two things being drawn, then it requires two separate draw commands to be able to rebind that thing in between. So batching is a lot about finding ways to avoid having to rebind data bindings and instead looking up the data based on the available indices.

I’ve been fiddling and learning and thinking a lot about all of the constraints and flexibilities provided by the tools (as in the wgpu APIs as a proxy to the other graphics APIs) and various ideas have been forming.

bevy_sprites instances quads by writing all instance data as quad vertex attributes. So if you have a flat sprite colour for example, that would be per vertex, not per instance. The downside of this is lots of duplicated data. Also for the vertex positions as each of the four vertices (or maybe six if there is no index buffer? I don’t remember) have to have positions and uvs. The upside is complete flexibility of those positions so that they can be absolutely transformed by a global transform.

In my bevy-vertex-pulling repository I have implemented two commonly-requested things: drawing as many quads as possible and drawing as many cubes as possible. Using vertex pulling and specially-crafted index buffers, the instances of quads or cubes can be drawn without a vertex buffer, using only per-instance data for the position and half extents. The vertex index is used to calculate both the instance index and the vertex index within the quad/cube. The cube approach is also a bit special because it only draws the three visible faces of the cube. They also output uvw coordinates and normals as necessary. At some point I would like to try using this approach for bevy_sprites, but that is already quite fast so it doesn’t feel like the highest priority, plus it depends on what transformations need to be able to be made on sprites. Translation and scale are supported and rotation could be added but also supporting shears would require a matrix per instance I guess and maybe that ends up not being worth it vs explicit vertices for quads, it would depend.

Drawing many arbitrary meshes with non-trivial shapes, so models with more than 256 vertices, are perhaps well-suited to using GPU instancing as they are also likely to use the same materials perhaps.

The bevy-vertex-pulling experiments are not done yet. I want to try out some more things to understand when different things help performance-wise. For example, bevy_sprites doesn’t use a depth buffer, so for opaque sprites it relies on draw order to place things on top of each other. That also means the same screen fragment is shaded multiple times which means that there are multiple sprite texture data fetches per screen fragment. Even if that doesn’t practically matter on high-powered devices, it could well matter on mobile where bandwidth is much more constrained. This is repeated shading of the same fragment generally called overdraw. To avoid overdraw you can do things like occlusion culling to just not draw things that are occluded by other opaque things in front of them, or use a depth buffer which will do this as part of rasterisation, only shading fragments in front of other fragments. And then you sort opaque meshes front to back to capitalise on the early-z testing that is done as part of rasterisation in order to skip shading occluded fragments. This only applies to opaque however. But it does raise the sorting aspect.

Batching involves lots of sorting in order to group things that can be drawn together. And sorting is also needed to capitalise on reducing overdraw to avoid repeated fragment shading costs and texture bandwidth and so on. Sorting many items can be expensive time-wise. I have done experiments with radix sorting and parallel sorting elsewhere for bevy_sprites where we sort twice, both in the queue stage before ‘pre-batching’ and then again in the sort phase after mesh2d and other custom things may have been queued to render phases that would then require splitting batches of quads. As such, bevy_sprites currently queues each sprite as an individual phase item with additional batch metadata and which means the sort phase has to sort every sprite again, and the batch phase merges the phase items into batches as much as it can, recognising that if an incompatible phase item falls within a batch, then those batch items cannot all be merged.

Now, yet another aspect is how instance data is stored. The options available are vertex/instance buffers (supported everywhere but cannot be arbitrarily indexed from within the shader so only works if you actually want to draw many instances of the same mesh and the mesh has a good amount of vertices), uniform buffers (broadly supported but limited size to 16kB minimum and only fixed-size arrays), storage buffers (variable-size arrays but only one per binding, and much larger sizes, not supported on WebGL2), data textures (broad support, large amounts of data, requires custom data packing/unpacking so will be unergonomic to use). For bevy-vertex-pulling I have used storage buffers as they are simple, flexible, and perform well. Long-term, they’re great. But given WebGL2 support is desired, we will have to support using one of the others. Perhaps just using more but smaller batches with uniform buffers would be sufficiently good.

To me, GPU instancing is a pretty small aspect of how to handle reducing draw commands and efficiently drawing lots of things. It’s a bit too constrained. Instead I suspect other, more flexible batching methods are more generally useful.

Ultimately the end goal is to have one draw command to draw everything in view. If we look again at the data bindings used for rasterisation, we have (possibly) vertex buffer, index buffer, uniform buffer, storage buffer, texture view, and/or sampler bindings.

So far I mostly referred to putting per-instance mesh and material data into uniform/storage/data texture buffers, but if you have separate vertex buffers for your meshes, you will still have to rebind per mesh. You can merge all your mesh data into one big vertex buffer by handling generic non-trivial mesh instances as meshlets - break them up into groups of primitives such as (I saw this suggested somewhere) 45 vertices to represent 15 triangles in a triangle list. And if the mesh doesn’t fill that many, then you pad with degenerate triangles. Each meshlet has corresponding meshlet instance data. This way you can pack all vertex data into one big buffer and never have to rebind it.

That leaves texture bindings. Unfortunately bindless texture arrays are still fairly new and not incredibly broadly supported outside of desktop. But with those, we can have arrays of arbitrary textures, bind the array, and store indices into textures in material data. And then we’re almost done. Otherwise, we could enforce that all our textures have the same x,y size and put them into an array texture and store the layer index in material data. Or use a texture atlas either with virtual texturing at runtime which would add a lot of complexity I expect, or offline as part of the asset pipeline. Those options are in increasing order of breadth of support, though 2d array textures are practically supported everywhere it seems, and I guess decreasing in ergonomics / simplicity.

One more stick in the mud is transparency. Currently we have an order-dependent transparency method which requires sorting meshes back to front for correct blending using the over operator. If we had an order-independent method such as weighted-blended order-independent transparency, then we wouldn’t have to sort the transparent phase.

My understanding is that then once we have a fully bindless setup, we can move to GPU-driven draw command generation by using indirect draw commands as we can write all the indices for materials and meshes and such into storage buffers in compute shaders, as well as the draw commands themselves. This provides an enormous performance boost where supported. With WebGPU we should have compute and at least some indirect draw support (single indirect, but no bindless texture arrays yet, I think) but for native desktop it would probably be the practical default basically everywhere?

As the ultimate goal is performance, I think we need to consider the journey that we are on, what the parameters, flexibilities, and constraints are, and then figure out what steps to take. I think this is necessary because I think we can put together a flexible and useful solution that supports different approaches depending on platform support and user needs. I’m getting there in my learnings as you can see from the above but I’m not quite there yet. My primary next steps are to experiment with the impact of using a depth buffer on overdraw for simple millions of quads with low fragment shader cost (pure data and trivial maths) both with and without sorting front to back for opaque, then with texture fetches (so like sprites), and then try out the single vertex buffer approach.

@micahscopes could you share your code?

superdump avatar May 21 '22 04:05 superdump

It was noted on Discord that UE and Godot gain a lot from GPU instancing, particularly for foliage and ‘kit bashing’ for rocks and things like that where you can translate, scale, and rotate to reuse meshes as creating the meshes in the first place has a high cost.

I have a PR open that uses an instance buffer for mesh instance data (model and inverse model, and mesh flags). It shouldn’t take too much on top of that to support GPU instancing where entities using the same mesh have their mesh data next to each other in the instance buffer so that all instances can be drawn at once. I wanted to make a demo/test with a forest with a few tree meshes procedurally placed many times. That would probably be a good test scene.

superdump avatar May 21 '22 14:05 superdump

Cool write up of your learning / thoughts so far @superdump.

bevy_sprites instances quads by writing all instance data as quad vertex attributes. So if you have a flat sprite colour for example, that would be per vertex, not per instance.

A tiny little aside, but this reminded me to bring this up since I'd wanted to raise this before. GPUs and graphics drivers can also handle constant attributes, which I think would be really quite important / beneficial for Bevy to expose at the Mesh level. It's sort of related to this topic, since it's part of the drawing abstraction, and the way a constant attribute is defined is conceptually in terms of the per-instance read progression for the attribute.

Last time I briefly looked at this I think their might be a minor abstraction issue in wgpu where it needs to be able to pass a step mode of none on Metal for constant attributes, but in principle if it's not already supported by wgpu then it should be a fairly trivial patch to expose uniformly. On Vulkan you just specify a per-instance stride of zero I believe.

rib avatar May 21 '22 15:05 rib

@superdump It was noted on Discord that UE and Godot gain a lot from GPU instancing, particularly for foliage and ‘kit bashing’ for rocks and things like that where you can translate, scale, and rotate to reuse meshes as creating the meshes in the first place has a high cost.

Yes, it is also great for workflows like this:

https://www.youtube.com/watch?v=-zGoFQKC9lQ

https://www.youtube.com/watch?v=TMb9X1Q2YzQ

Instead of creating a single mesh, you create a small number of primitive parts (beams, floors, walls, etc.) and then combine those primitives together to create a house.

This has some huge benefits: you get much better texture resolution (since each part has its own UVs), you can reuse the same material across hundreds of objects (for better batching), and it can even have less vertexes than a single mesh.

Of course because you're reusing the same primitives over and over and over again, instancing really helps with performance.

Pauan avatar May 21 '22 16:05 Pauan

@micahscopes could you share your code?

Unfortunately my laptop got away from me recently (:hot_face:) so I lost the branch with much of that work on it ... I'm in planning stages for a rewrite now. You can see a little video of the lost prototype here showing interactive frame rates with 8 million + triangles on an integrated laptop GPU, with one draw call to compute the uniforms and one draw call to render. I thought that was quite good. I seem to remember that after ~15 million triangles things slowed down a lot.

One thing that did survive was this modification of PicoGL.js to use the WEBGL_multi_draw_instanced_base_vertex_base_instance draft extension, along with an example, which could be good starting point for exploring what's possible with instancing and multidraw instancing. Just note that on some systems instance offsetting could be emulated, meaning that under the hood it'd actually be doing multiple rebindings to the instance buffer and multiple draw calls.

micahscopes avatar May 29 '22 14:05 micahscopes

@micahscopes - the video you linked is gone unfortunately. I'm going to read back through the beginning of this thread to see what has been discussed but what had you implemented on your branch?

superdump avatar Jun 24 '22 10:06 superdump

I still haven't read all of this thread, as far as I can remember, but I'm having ideas and I may even have written the ideas before, such is my memory these days. But anyway... the ideas.

I think an approach to figuring out how all of this instancing and batching and so on should look is to consider the different types of data that need to be accessed and how/why they cause a need to re-bind and so cause a need for separate draw commands.

As such, we have the following significant buffer types, off the top of my head:

  • index
    • range specified when doing an indexed draw command
  • vertex
    • tightly packed, but require 'parsing' (as in bit masking and shifting) / reconstruction in the shader if the desired types are not supported
    • range specified when doing a plain draw command
    • vertex rate
    • instance rate
      • range specified when doing either a plain or indexed draw command
  • uniform
    • 16kB max for compatibility, practically NVIDIA has a max of 64kB and others more
    • only fixed-size arrays, but variable-sized arrays can be emulated with a trailing fixed-size array in the struct type and encoding the number of valid elements in that array, then using dynamic offset binding (which in turn requires a separate draw command) and ensuring the buffer is as large as the last dynamic offset + the full uniform struct type std140 size as if the full array were populated.
    • dynamic offset alignment requirements can incur significant waste (such as for MeshUniform which is 132 bytes but often dynamic offset uniform buffers require 256-byte alignment so 124 bytes are wasted per mesh!)
    • internal alignment/padding requirements incur wastage too, or require care to better-pack members
  • storage
    • 128MB max for compatibility I think, practically much larger
    • dynamic offset alignment requirements incur some waste, but if I recall correctly, the most significant requirement was about 16 bytes, which is not tooooo bad.
    • internal alignment/padding requirements
  • texture
    • can use as a data texture with textureLoad, or with a sampler. a data texture would also require parsing/reconstruction of the really-stored data in the shader.
    • 1d/2d/3d/cube
    • many formats suitable for many purposes
    • maybe trashes the texture cache if many incoherent fetches are needed?

We need to deal with mesh and material data and want to try to avoid rebinding, but let's take steps starting with the simplest case: GPU instancing. The goal here is that if the same mesh and material are used, the draw commands for these mesh instances can be merged. If we start from the current architecture where one mesh means one draw command:

  • if the same mesh and material are used, the draw commands for these mesh instances can be merged
    • the simplest would be to just always do this detection, but that adds overhead and could cause a performance degradation for cases where no instancing can be done
    • if it does incur a performance degradation, I have seen in other game engines that it is often an opt-in thing. We could make it so that if an entity has an Instanced component, we try to group it with other entities with the same, identifying the entities that have the same pair of mesh and material
  • the per-mesh (i.e. per-instance) data (i.e. MeshUniform data currently) needs to be written in sequence to one buffer for the batch of instances
    • this could be an instance-rate vertex buffer which would automatically advance and the instance range would be provided when issuing the draw command
      • if we always use an instance-rate vertex buffer for per-mesh data, non-instanced meshes can just be individual draw commands that bind a mesh and draw one instance (but the correct instance index for the relevant per-mesh data)
        • I have this implemented here: #4319
        • the downside of this is that types need to be reconstructed from the data in the shader e.g. four vec4s -> mat4
        • there is no flexibility in the indexing other than providing the instance range to the draw command - the instance data has to be sequential else we incur multiple draw commands as one range can be specified per draw command
    • this could also be 'pulled' from a uniform/storage/data texture buffer, still using the instance index to index into such a buffer
      • I have some examples of this kind of pulling here: https://github.com/superdump/bevy-vertex-pulling/tree/main/examples/quads
      • a downside is still the internal padding/alignment for uniform/storage buffers, the limited size for uniform buffers though this could just be a limit on the batch size, storage buffers are not supported in WebGL2, data textures require reconstruction of data in the shader
      • there is full flexibility over random, non-contiguous indexing

From here, if we had many entities using the same mesh, but different materials but the same textures, we could merge draw commands by storing a per-mesh (i.e. per-instance) index into a material buffer. We can merge more draws that use different texture data if virtual texturing or a texture atlas is used. And, if we then continue to implement bindless texture arrays, we can also merge draw commands for different materials, different textures, by storing texture indices in the material data.

Or, if we have the same material but different meshes and not multiple instances of the same mesh, and the vertex indices are contiguous in the index buffer, we can merge draw commands to draw the entire index range.

If we want to get all the way to just using one or very few draw commands to draw basically all the things, we need:

  • bindless texture arrays / virtual texturing / texture atlas / array textures for material textures
  • texture indices/offsets or whatever stored in material data
  • material data (i.e. standard material uniform data right now) stored in an indexable way
  • material data index stored in instance data
  • instance data (i.e. mesh uniform data right now) stored in an indexable way
  • a flexible and consistent way of mapping from arbitrary index ranges to be drawn to instance data
    • this is where storing indices in batches in the index buffer to be able to map from index to batch (how does this work... maybe some hack with the indexing to be able to do index / batch size = batch?), padding with dummy values up to the batch size
    • if the vertex index were also pulled, then the index passed to the draw command could be directly mapped to a batch as they would be sequential so index / batch size = batch. this would prevent post-transform caching where multiple indices from the index buffer within a batch are the same

Well, that brain dump feels like some more concrete progress to me...

superdump avatar Jun 24 '22 12:06 superdump

Well, that brain dump feels like some more concrete progress to me...

The brain dump was super helpful for me to read through! There are so many combinations of ways to use all the various tools available, it's really nice to have everything laid out like that.

@micahscopes - the video you linked is gone unfortunately. I'm going to read back through the beginning of this thread to see what has been discussed but what had you implemented on your branch?

Well the good news is that I've made progress on the redo and at this point am further along than the lost branch! Examples:

Just to be clear, my specific goal is to draw parametric surfaces with quickly changing levels of detail in the browser. This is one possible use-case for instancing but definitely not the only one.

The basic gist is this:

  1. An atlas of tiling tessellation geometries is pre-generated on startup and added to a single vertex buffer (using some Poisson sampling and Delaunay triangulation libraries)
  2. A "control mesh" is added to an instance attribute buffer, representing the corner points of patches to be drawn at varying levels of detail
  3. levels of detail are set for each patch of the control mesh so the shared edges of spatially adjacent patches will have have aligned vertices (passed as instance attributes). In the demos this happens on the CPU but my plan is to do most of this work on the GPU using transform feedback, and additionally to transform the control mesh vertices on the GPU as well.
  4. slices of the tessellation geometry buffer are instance drawn over each "patch" of a control mesh, using the WEBGL_multi_draw_instanced_base_vertex_base_instance if it's available, otherwise rebinding the instance attribute buffer at a different offset for each draw range. If multiple adjacent patches in the instance attribute buffer share the same tessellation, they'll be drawn as part of the same instance draw range.

The code is still pretty experimental and in flux but I added some brief comments to the terrain example.

I've found that for my purposes this is going to work well enough... with the right parameters it even does > 30 fps on my phone!

micahscopes avatar Jun 24 '22 20:06 micahscopes

I'm starting to get ideas about the problems and solutions.

bevy API user concerns

  • Instance data (e.g. mesh transforms)
  • Multi-instance data (e.g. material data)
  • What can be batched together
    • This too could probably be avoided with some abstraction to be able to look up what cannot be batched together...
  • What is desired to not be batched together
  • What binding types are desired to be used (maybe data textures work better than uniform buffers on some platform and so detection and an override is needed)

Solution ideas

  • A flexible data binding abstraction supporting use of the appropriate types of buffers for the usage rate of the data
    • By usage rate I mean whether the data is per instance, or shared across multiple instances
  • A way of logically batching instances that could be batched together
  • Actual batching that can be converted into batched phase items (a uniform buffer only supports up to 16kB per binding, so perhaps instance data must be split across multiple dynamic uniform buffer bindings and so multiple draw commands)
  • There should be good default choices for the data binding for compatibility/performance:
    • WebGL2 (uniform buffers, instance-rate vertex buffers, data textures?)
    • bindful (instance-rate vertex buffers, storage buffers?)
    • bindless (instance-rate vertex buffers, storage buffers?)
    • It should be possible to control which data binding method is used, as long as it is supported

Data binding abstraction

  • Buffer types: uniform, storage, data texture, instance-rate vertex buffer
  • Usage rates: per-instance, multi-instance / shared
  • Per-instance will index into multi-instance
  • Need to handle splitting across multiple bindings (e.g. a uniform buffer containing a fixed-size array that holds N pieces of instance data, needs to store M instances with M > N, so M is split across multiple dynamic offsets into the same uniform buffer) and track
  • Stretch goal / later / separate concern: generate the shader binding code (types, bindings) and inject them into the shader - this could be part of AsBindGroup

I think that's the piece to build first, then it should become clearer what additional batching-related machinery is needed on top in order to get from extracted computed visible things to queued batched phase items, I guess.

superdump avatar Jul 18 '22 19:07 superdump

I've been working on an instanced renderer for a project of mine that should be useful here; the code isn't hosted yet, but I should be able to split it out into its own repo for sharing soon.

My intent is to build an automated indirect-instanced equivalent to the existing Mesh / Material pipeline, whereby instances are represented as compositions of components that are extracted and auto-batched by the render machinery behind the scenes. It doesn't go all the way to the 'golden path' solution @superdump is refining, but solves some of the intermediary problems using bevy's existing abstractions. It's based on 0.7 for now, but I'd be happy to bring it up to speed with git master if it proves to be contribution material.

To give a broad overview:

Instancing-specific materials are represented by a new SpecializedInstancedMaterial trait, which is identical to the existing SpecializedMaterial trait save for a new associated Instance type that generalizes over the extraction and preparation of components into an instance data buffer.

Instances live in a storage buffer at present, though generalizing up to uniform buffers via sub-batching into extra draw calls is on my to-do list. I experimented with using a vertex buffer to store per-instance data, but ran into the already-discussed flexibility issues - it becomes more difficult to allow for specialization over instance data when you have to worry about packing and unpacking anything larger than a vec4, though I suspect it could be done with enough pipeline and shader support machinery.

Meshes, Materials and Instances are all batched together using key types during the prepare phase, where the key for a batch of instances is a composite of its mesh and material batch keys.

Meshes are batched together into pairs of vertex / index buffers by their primitive, vertex format and index format. This allows vectors of vertex byte data to be appended directly, and index data to be composed with an offset for each source mesh.

Materials are batched by a combination of their alpha mode (using the existing three-pass setup) and their associated Key type, allowing the trait implementor to control batching via the existing pipeline caching machinery. I'm mulling over the idea of splitting this out into distinct PipelineKey and BatchKey types, but am not yet sure whether that's appropriate given that many existing per-material parameters would now live in per-instance data.

Within each batch of instances, sorting by mesh is applied to allow indirect draw calls to offset based on mesh index, followed by a depth sort equivalent to the existing per-pass setup.

Thus far, I have it rendering a Mesh * Material * Instance Data permutation cube in a multi-view test scene with some simple shaders and proper batching:

permutations

It should be feasible to make StandardMaterial work with it by introducing the MeshUniform shadow receiver flag as a material-level batch key and binding it at each draw, though having pbr.wgsl take the flags via vertex input would make it a lot more modular for these purposes given that it doesn't use any other part of MeshUniform.

In addition, I've implemented an 'InstanceBlock' abstraction that allows an entity to reserve N instances of a given mesh / material combination for specialized instancing. These are extracted into the render world, batched alongside regular instances, and prepared with the appropriate buffer binding, offset and size for use by consuming code.

This is useful for cases where you want to avoid the overhead of batching many instances, such as GPU particles, or rendering a grid of voxels that are converted from a CPU-side data model via compute shader:

godris

Anyway, apologies for contributing another brain dump to an already lengthy discussion - I've been hacking on this for a few days straight, and just wanted to get it out there. Code to come soon!

Shfty avatar Jul 28 '22 07:07 Shfty

Oooooo! Cool! I'm going to have to re-read what you wrote a couple of times to understand it I think, but it certainly sounds promising!

superdump avatar Jul 28 '22 09:07 superdump

I've been working on an instanced renderer for a project of mine that should be useful here; the code isn't hosted yet, but I should be able to split it out into its own repo for sharing soon.

I look forward to seeing the code. :)

My intent is to build an automated indirect-instanced equivalent to the existing Mesh / Material pipeline, whereby instances are represented as compositions of components that are extracted and auto-batched by the render machinery behind the scenes. It doesn't go all the way to the 'golden path' solution @superdump is refining, but solves some of the intermediary problems using bevy's existing abstractions. It's based on 0.7 for now, but I'd be happy to bring it up to speed with git master if it proves to be contribution material.

My thoughts and ideas in brain dumps are iterative steps toward understanding the scope of the problem space. I am likely missing things that you know/have discovered through your implementation, so if nothing else it will surely clarify some things to see what you have done!

To give a broad overview:

Instancing-specific materials are represented by a new SpecializedInstancedMaterial trait, which is identical to the existing SpecializedMaterial trait save for a new associated Instance type that generalizes over the extraction and preparation of components into an instance data buffer.

I was with you up to "generalizes over the extraction and preparation of components into an instance data buffer". Based on what you wrote above, it sounds like this associated type defines something about the components that have to be extracted, and something about how they are prepared into instance data. Could you paste an example of this associated type? It sounds like it needs to be queries or component types to be used in queries for extraction perhaps leveraging the ExtractComponent trait? And then the corresponding preparation of each of those perhaps using some other trait implementation for each component type?

Instances live in a storage buffer at present, though generalizing up to uniform buffers via sub-batching into extra draw calls is on my to-do list. I experimented with using a vertex buffer to store per-instance data, but ran into the already-discussed flexibility issues - it becomes more difficult to allow for specialization over instance data when you have to worry about packing and unpacking anything larger than a vec4, though I suspect it could be done with enough pipeline and shader support machinery.

Starting with a storage buffer makes sense. I was intending to do the same. It allows you to focus on the rest of the owl before getting into the weeds of the data storage stuff I've been rambling about above. I just wanted to understand the different possibilities up-front so I didn't make any significant missteps in the design that would waste a lot of time.

I agree that instance buffers / data textures could work ergonomically with some traits for packing data, and shader code generation and injection that can unpack them into structured data again. Probably sub-batching over fixed arrays of T in uniform buffers would be simpler for compatibility, though will lose a bit of performance due to more draw commands I guess.

Meshes, Materials and Instances are all batched together using key types during the prepare phase, where the key for a batch of instances is a composite of its mesh and material batch keys.

Interesting and sounds sensible. I haven't thought too much about how the actual batching would be done yet. If possible I would like to avoid the two-stage batching that happens with sprites at the moment where sprites are sorted, a batched phase item is added to the render phase for the batch up to and including that sprite, those batched phase items are then sorted again (because transparent 2d phase items that require splitting batches could be queued elsewhere), then batching is done again to split/merge down to the 'actual' batches. All of this is costly and I think there are probably better solutions. Communicating the information needed to identify which things can be batched to the final batching stage feels like a possibility. But then you also need to order the instance data according to the final sort order. This would mean moving data preparation from the prepare stage to later, which seems quite controversial at first. Maybe this is going too far though would the index buffer have to be sorted by mesh instance z too to be able to do a sorted direct draw command? That seems possibly a bit far out and probably/possibly unnecessary.

Meshes are batched together into pairs of vertex / index buffers by their primitive, vertex format and index format. This allows vectors of vertex byte data to be appended directly, and index data to be composed with an offset for each source mesh.

Makes sense - if meshes have different vertex attributes that get serialised into the vertex buffer then they will have a different vertex layout and so a different pipeline and cannot be batched.

Some things that you've written and some thoughts that I've had so far from reading and understanding make me feel like pipeline specialisation may need to be executed earlier. In this particular case I'm thinking that it would be good to know the vertex attributes that are actually going to be used by pipelines before creating the vertex buffer(s) rather than just serialising all attributes in the Mesh whether they are used by pipelines or not. That sounds like a later improvement to solve though.

Materials are batched by a combination of their alpha mode (using the existing three-pass setup) and their associated Key type, allowing the trait implementor to control batching via the existing pipeline caching machinery. I'm mulling over the idea of splitting this out into distinct PipelineKey and BatchKey types, but am not yet sure whether that's appropriate given that many existing per-material parameters would now live in per-instance data.

Presumably the batches are also split on different bindings, e.g. textures. Or did you implement bindless texture arrays (and company) already? A batch key could make sense, and I suppose it is a superset of the pipeline key, precisely because there can be other reasons to split a batch (data bindings, something special about the draw command like it is an instanced draw of the same Mesh).

Within each batch of instances, sorting by mesh is applied to allow indirect draw calls to offset based on mesh index, followed by a depth sort equivalent to the existing per-pass setup.

What do you mean by 'sorting by mesh'?

Thus far, I have it rendering a Mesh * Material * Instance Data permutation cube in a multi-view test scene with some simple shaders and proper batching:

permutations

Cool! I haven't looked at multi-view yet. What is that about? Why is it useful here? I thought it was maybe useful for some stereoscopic VR type stuff?

It should be feasible to make StandardMaterial work with it by introducing the MeshUniform shadow receiver flag as a material-level batch key and binding it at each draw, though having pbr.wgsl take the flags via vertex input would make it a lot more modular for these purposes given that it doesn't use any other part of MeshUniform.

The MeshUniform is the bulk of the per-instance data isn't it? That and indices into material and index buffers? The shadow receiver flag is supposed to be per-instance, and it shouldn't need to affect batching.

In addition, I've implemented an 'InstanceBlock' abstraction that allows an entity to reserve N instances of a given mesh / material combination for specialized instancing. These are extracted into the render world, batched alongside regular instances, and prepared with the appropriate buffer binding, offset and size for use by consuming code.

This is useful for cases where you want to avoid the overhead of batching many instances, such as GPU particles, or rendering a grid of voxels that are converted from a CPU-side data model via compute shader:

godris

Sounds good! So a way of instancing 'manually' to bypass the generic instancing code when performance is more important and you know the tradeoffs (i.e. stuff may be drawn in the globally wrong order but you're choosing to sidestep that and take matters into your own hands)?

Anyway, apologies for contributing another brain dump to an already lengthy discussion - I've been hacking on this for a few days straight, and just wanted to get it out there. Code to come soon!

No apologies necessary at all, this is very useful discussion from my perspective! <3

superdump avatar Jul 28 '22 09:07 superdump

As promised, I've split the code out into its own repo which can be found here: Shfty/bevy_instancing

I've converted the regular instancing test scene into an example that can be run through cargo. There's no equivalent for InstanceBlock yet since the current test case is the main scene of the associated project, but I'll look at putting together some simple compute-driven animation to show it off and highlight the need for manual handling of transparency depth.

I was with you up to "generalizes over the extraction and preparation of components into an instance data buffer". Based on what you wrote above, it sounds like this associated type defines something about the components that have to be extracted, and something about how they are prepared into instance data. Could you paste an example of this associated type? It sounds like it needs to be queries or component types to be used in queries for extraction perhaps leveraging the ExtractComponent trait? And then the corresponding preparation of each of those perhaps using some other trait implementation for each component type?

That's correct - it's driven by a new Instance trait which is modeled as a combination of ExtractComponent and RenderAsset, along with some extra instance-specific functionality to allow for automating the process on the backend.

The most basic type of extracted instance data is MeshInstance - it contains a mesh handle and transform, and is designed to be composed into downstream Instance implementors.

The CustomMeshInstance type is the prime example of a downstream Instance at present, adding a color parameter to the base data for use in custom.wgsl, which also covers the indirect_instancing::instance_struct wgsl import that provides the GPU-side definition of InstanceData so consumers like CustomInstanceData can compose it.

Interesting and sounds sensible. I haven't thought too much about how the actual batching would be done yet. If possible I would like to avoid the two-stage batching that happens with sprites at the moment where sprites are sorted, a batched phase item is added to the render phase for the batch up to and including that sprite, those batched phase items are then sorted again (because transparent 2d phase items that require splitting batches could be queued elsewhere), then batching is done again to split/merge down to the 'actual' batches. All of this is costly and I think there are probably better solutions. Communicating the information needed to identify which things can be batched to the final batching stage feels like a possibility. But then you also need to order the instance data according to the final sort order. This would mean moving data preparation from the prepare stage to later, which seems quite controversial at first. Maybe this is going too far though would the index buffer have to be sorted by mesh instance z too to be able to do a sorted direct draw command? That seems possibly a bit far out and probably/possibly unnecessary.

I came to the same conclusion re. the sprite batcher - BatchedPhaseItem initially seemed promising, but would have required creating instancing-specific equivalents to Opaque3d, AlphaMask3d, Transparent3d to encode the keys / ranges required for multi-category batching, along with all the main pass machinery necessary to drive them.

The batching itself is done in the instanced material plugin - the various key types are strewn throughout at the moment, but the gist is that each category (mesh, material, instance) has its own key type holding parameters that identify members of the category as being mututally compatible. The members themselves are collected into HashMap / BTreeMap containers by their key (taking advantage of B-tree for inline sorting), eventually culminating in the generation of BatchedInstances entities, which are then rendered by the DrawBatchedInstances render command.

Makes sense - if meshes have different vertex attributes that get serialised into the vertex buffer then they will have a different vertex layout and so a different pipeline and cannot be batched.

Some things that you've written and some thoughts that I've had so far from reading and understanding make me feel like pipeline specialisation may need to be executed earlier. In this particular case I'm thinking that it would be good to know the vertex attributes that are actually going to be used by pipelines before creating the vertex buffer(s) rather than just serialising all attributes in the Mesh whether they are used by pipelines or not. That sounds like a later improvement to solve though.

That sounds like a more general improvement to the mesh pipeline rather than something specific to instancing, since it could reduce memory footprint, improve data throughput, etc. for the regular one-mesh-one-draw approach too.

Presumably the batches are also split on different bindings, e.g. textures. Or did you implement bindless texture arrays (and company) already? A batch key could make sense, and I suppose it is a superset of the pipeline key, precisely because there can be other reasons to split a batch (data bindings, something special about the draw command like it is an instanced draw of the same Mesh).

I've not touched on texture or buffer bindings for custom materials yet - all of the visuals above are done via custom fragment shading.

As-is, including a handle to the bound resource as part of the SpecializedInstanceMaterial::Key type would allow it to split batches (i.e. by two different textures targeting the same bind slot), but would also split cached pipelines in the same way, which isn't ideal - separating the key type into per-pipeline and per-batch should cover that for the non-bindless case.

What do you mean by 'sorting by mesh'?

Since meshes are batched together into shared vertex/index buffers, one batch of instances can contain more than one mesh. For a given draw call, you pick a mesh by using different offsets and counts into the vertex, index and instance buffers. So, if you want to draw all the instances with a given mesh in one call, you have to sort them by their mesh handle / index first.

It's worth noting that the draw calls are encoded via indirect buffer in this implementation, where each batch gets one indirect per mesh. The indirect data is currently recreated on each draw alongside everything else, but I want to do a change tracking pass so all the relevant buffer data is cached and only sent to the GPU when necessary.

Cool! I haven't looked at multi-view yet. What is that about? Why is it useful here? I thought it was maybe useful for some stereoscopic VR type stuff?

This isn't multi-view in the low-level API sense, just the "more than one camera" sense - i.e. making sure to respect ExtractedView and VisibleEntities to make sure batching occurs per-view and doesn't queue any unnecessary draws.

I don't know whether bevy already uses the low-level API to drive multiple cameras that view the same scene, but as I understand it that's essentially what it's for; like a superset of instancing with a scene -> camera relation instead of a mesh -> instance one.

The MeshUniform is the bulk of the per-instance data isn't it? That and indices into material and index buffers? The shadow receiver flag is supposed to be per-instance, and it shouldn't need to affect batching.

Per its name, MeshUniform is tightly tied to being a single global uniform that can't change within a single draw call like instance data can. So yes, the flag should indeed be per-instance data, but the current pbr.wgsl fragment shader implementation prevents that by fetching it directly from the bound MeshUniform.

That's a non-issue if I step this up into pull request territory and fork the engine with the appropriate modifications to bevy_pbr, but would need to be done through batching in the current 0.7 implementation.

Sounds good! So a way of instancing 'manually' to bypass the generic instancing code when performance is more important and you know the tradeoffs (i.e. stuff may be drawn in the globally wrong order but you're choosing to sidestep that and take matters into your own hands)?

Yes, precisely - using the 3D tetris scene above as an example, it skips the CPU-side batching to avoid overhead in cases where many cells are filled, and directly writes the board state into the instance buffer via compute. The draw order issue manifests itself for adjacent transparent cells, but can be worked around for this specific 'ordered grid' case by culling interior cell faces.

Speaking in terms of the examples further up the page, the batched instances as i've implemented them would be 'standard instances', InstanceBlock is a more generalized version of 'particle instances' that gets auto-packed after standard instance data, and 'thin instances' would likely be a halfway point whereby the contents of a block is controlled through a variant of the standard component-based interface minus the ComputedVisibility.

Shfty avatar Jul 29 '22 01:07 Shfty

I've pushed up some new commits that add an example for InstanceBlock, implement the mentioned PipelineKey and BatchKey split, correct some batch ordering issues, and other misc tidying.

This has prompted some further thoughts on transparency ordering:

Currently, transparency ordering is only correct for instances that share the same mesh within a given batch. This is down to sorting each batch by mesh for the sake of issuing less draw calls, where the scaling is O(num_meshes) calls.

This could be solved for by making depth the first-class sort predicate, then generating one indirect draw call for each contiguous block of same-meshed instances in the batch - a similar approach to the one taken by the 2D sprite batcher. This would result in correct ordering within each batch, but result in O(num_mesh_changes_along_z) scaling, which could be extremely bad in specific stress cases like dense CPU-particle fields with one material and many meshes.

At which point, I start to question whether it's worth doing. It would mean per-batch correctness, but the segmented nature of batching means inter-material depth sorting is infeasible; the more you chop up batches based on changes along view depth, the less benefits you gain from instancing, to the point where you might as well use the regular pipeline in worst-case scenarios where the batcher is constrained by the composition of the scene.

So, my gut is telling me that the right approach would be to stick with the performance-first approach for now - since instancing is primarily a performance tool - document the batch ordering behaviour, and defer trying to solve for absolute correct ordering until an order-independent transparency solution can be implemented, taking care of the whole problem space at once rather than addressing a specific subset of it with a knock-on detriment to passes that leverage the depth buffer.

Now BatchKey has been realized, I've also been mulling over what degree of render order control could be exposed to consuming code for the sake of working around potential inter-batch sort issues. BatchKey allows type-level control over the order of batches within a given material via Ord, but the order for the materials themselves is controlled by the Ord implementation of Handle<M> where M: SpecializedInstancedMaterial, with the equivalent being true for Handle<Mesh> with mesh batching. Those are driven by an asset ID, so I suppose that could already be controlled based on the order in which the respective assets are registered with the app, though I'm not sure if bevy makes any guarantees about that.

Shfty avatar Jul 30 '22 05:07 Shfty

Following today's 0.8 release (great stuff, lots of good changes!) I took the liberty of bringing the code up-to-date with bevy's main branch.

It needs a refactor pass to bring SpecializedInstancedMaterial / InstancedMaterial in-line with the new StandardMaterial / Material patterns along with other related changes, but functionality-wise it's all intact and working.

I've added a new TextureMaterial type to test texture sampling, and have added some permutations of it to the instance example; the four textured materials batch separately, but share cached pipelines where appropriate:

screenshot

Further to that, I note that the import changes around bevy_pbr remove the barriers (that I'm thus far aware of) to implementing an instanced version of StandardMaterial!

I've also started the groundwork for generalizing over storage and uniform buffers for instance data. The data types and conditionals are in place, but the uniform drawing logic still needs to be done - I'm having some trouble with my wasm-server-runner toolchain following the wgpu version bump, so will have to figure that out before progressing it further.

Finally, here's a brief clip of the compute particle example for any interested readers:

https://user-images.githubusercontent.com/1253239/182017199-96dc5c6f-d84a-427a-b650-2ccae3c426c3.mp4

Shfty avatar Jul 31 '22 08:07 Shfty