2D: Combine texture state to batch more subsequent commands together
This PR builds on #92797, to improve the 2D renderer, such that it can combine commands that change texture state, allowing for more cases to be rendered as a single draw call.
Using driver limits, the current implementation will support up to 64 textures in a single batch. Additionally, samplers have been moved back to an array for canvas rendering, so batching can dynamically address the default samplers like SAMPLER_NEAREST_CLAMP. 3 additional sampler states are supported that are not available in the default samplers, such as mirror-repeat, bringing the per batch to a maximum of 15 sampler states.
Another important change in this PR is to move the specular_shininess to PushConstant state, allowing the field to be used for #97058.
Bugfixes
- As noted in this comment, fixes #97586 for the forward+ renderer, but not compatibility
- Closes #97593 which is a forward+ issue only
- Fixes #97774
- Fixes #98005
Background
One benefactor of this improvement is text rendering, which changes texture state to render effects such as shadows. A demonstration of that was a comment by @passivstar. With this PR, we can see that the number of draw calls was reduced from 40,332 down to 4, which an associated frame-rate increase:
| master | This PR |
|---|---|
| 40 fps | 400+ fps |
| 40332 draw calls | 4 draw calls |
Cases where unique textures are used for sprites may also benefit from this PR. To demonstrate this, I modified @Calinou's test project to draw 10,000 sprites, from this comment, to alternate between two sprites using different textures.
| master | This PR |
|---|---|
| 214 fps | 469 fps |
| 10004 draw calls | 1 draw call |
[!NOTE]
Also shows that the label is combined into the same draw call, so the entire scene is rendered in a single draw call.
@clayjohn I have not converted the SAMPLERS back to an array, but if we did, we could use that for sampler state dynamically, rather than a separate array. I would extend the length from 12 to 16, so that the 2D renderer could add up to 4 unique sampler states per draw call, that aren't available in the default list. That is probably MIRROR_REPEAT, which isn't available in the default sampler states.
Also note that this PR will help with moving to batching multiple command types together.
Preliminary tests are very good. Running the scene in https://github.com/godotengine/godot-demo-projects/tree/master/2d/platformer It goes from 566 batches down to 33!
No text appears when GODOT_2D_TEX_BATCHING_ENABLED=1 is used to run Godot, so the scene appears just gray. This occurs both with Vulkan and Direct3D 12 (but not OpenGL). The text and its shadow appear normally without this variable.
This is caused by a shader error. I'm about to paste a diff that will fix it.
With your MRP I go from 17 FPS to 90 FPS :)
No text appears when GODOT_2D_TEX_BATCHING_ENABLED=1 is used to run Godot, so the scene appears just gray. This occurs both with Vulkan and Direct3D 12 (but not OpenGL). The text and its shadow appear normally without this variable.
This is caused by a shader error. I'm about to paste a diff that will fix it.
With your MRP I go from 17 FPS to 90 FPS :)
Awesome, thanks @clayjohn!
This diff does two things:
- changes the
specular_shininessproperty tobatch_indexeswhen usingUSE_BATCH_TEXTURES - Combines the instance uniform set with the canvas texture uniform set so the total number of uniform sets is 4 (that fixes https://github.com/godotengine/godot/issues/97341)
This diff works, but there should be some cleanup after to change _bind_canvas_texture() to something else and rename diffuse_or_uniform_set
@clayjohn thanks – I merged your changes. I'll remove the environment check and tidy up the code, as then I will adjust the _bind_canvas_texture function.
Latest changes no longer use the push_constant, per recommendation from @clayjohn. Thanks for the assist!
I am going to remove the editor property for the texture data buffer size. I will calculate it dynamically, as it will never be very big
NOTE
I have removed the editor property as the implementation dynamically adjusts the buffer size to settle on a good default.
Also addresses to bugs in RD batching.
Will need a rebase after #97554 - sorry that one might be a bit painful, but we wanted to fix the regression in priority for 4.4.dev3.
@akien-mga rebased and conflicts have been resolved
@AThousandShips I've renamed indexes → indices
Using @clayjohn's words, with this one simple trick, frame rate for his pathological test case jumped from 25 fps to 45 fps on my M1.
The improvement was to remove all the allocations building the RD::Uniform arrays each batch. I dropped the number of transient allocations from 2.9 million for a 3-second window:
to 8,796 over the same periond
We can see that removing the append_id gained us 40% back on the _render_batch function, with this before snapshot:
An this after:
Next I want to focus on the get_cache call, as I think we can optimise for our use case in the 2D renderer. It isn't critical for this PR, but we can see that if we focus on just the _render_batch call, 80% of the usage is charged to the get_cache call:
It has to be very generic and compare all the uniform state, including the binding, type, number of RIDs and if the RIDs are the same. We can do better for these two uniforms in the canvas renderer I believe
I just checked out this PR; I'm getting weird artifacts in the editor and in game (not reproducable in 4.4 dev 3). ~~Might be related to Wayland?~~ reproduced it on Gnome x11.
Specs:
Godot v4.4.dev (d9b1bd1af) - Pop!_OS 22.04 LTS on Wayland - X11 display driver, Single-window, 2 monitors - Vulkan (Forward+) - dedicated AMD Radeon RX 6800 XT (RADV NAVI21) - AMD Ryzen 9 5900X 12-Core Processor (24 threads)
MRP: Here is a somewhat minimal project (it was a small prototype) in which it happens for me: drone_prototype_compressed.zip
Looks like this issue only occurs if the project was started in v4.3, or 4.4 dev 3 (versions prior to this PR) and then subsequently be opened in this PR. So, if you make a new project within this PR it doesn't happen.
If you delete the .godot folder containing the shader cache, does that solve it?
If you delete the .godot folder containing the shader cache, does that solve it?
Nope deleted the .godot; doesn't seem to make any difference.
@LeeWannacott You would also need to clear all the godot shader caches and your driver shader cache to be sure it isn't a shader issue.
On PopOS the shader caches are stored at:
- ~/.local/share/godot/shader_cache (for built in shaders)
- ~/.local/share/godot/app_userdata/your_project_name/shader_cache
- ~/.local/share/godot/app_userdata/your_project_name/vulkan
I can't find where AMD stores their shader cache on PopOS, but I would try deleting the above and see if it fixes things
@clayjohn Deleted 1., 2. ,3. and .godot in the projects folder and deleted ~/.cache/mesa_shader_cache (which I think is where AMD stores their cache?)
Still got the same issue. If I open the project with 4.4 dev 3 its perfectly fine, it only happens when I open with godot built from this PR.
Update: Sorry, I was wrong. It does actually happen when making a new project with this PR, but only when selecting forward+ renderer, selecting compatibility is fine (which mislead me).
I'll try the PR out on Windows.
If it happens in a new project then the shader cache definitely isn't the problem. Thanks for helping us rule that out.
Most likely then the change to the shader has exposed a driver bug with your GPU.
Most likely then the change to the shader has exposed a driver bug with your GPU
Yep looks like it, Managed to reproduce it on Windows (same hardware).
Most likely then the change to the shader has exposed a driver bug with your GPU
Yep looks like it, Managed to reproduce it on Windows (same hardware).
That seems surprising, as the drivers would be different on Windows.
Latest push
- Reimplements debug draw support to close #98005 as demonstrated in this comment
- Includes additional optimisations to remove the uniform cache, and uses the
LRUCacheto limit maximum number of cachedBATCH_UNIFORM_SETinstances to 128
The improvement to remove the uniform cache and replace it with an LRUCache further improved the drawcalltest pathological case from 45 fps to 75 fps. The following screen shot shows a significant reduction in CPU-bound work where ❶ uses the uniform cache and ❷ uses a local cache[^1]:
Other benchmarks, that benefit from batching are largely unaffected by the change.
[^1]: Gc is giga-cycles
Thanks @Calinou – I'll look at your third test case now
@Calinou Curiously, I am seeing the opposite, that this PR improves over the official dev 4.4 build (prior to this PR). However, it does appear that #98212 is a significant improvement (within Metal), and may have had a big impact.
[!NOTE]
I tried Vulkan, and I am seeing the same performance between this PR and the dev 4.4 offical.
This PR: ~54 fps
Godot.app/Contents/MacOS/Godot --disable-vsync --screen 1 --print-fps
Output:
Godot Engine v4.4.dev.custom_build.2039402ac (2024-10-08 07:03:19 UTC) - https://godotengine.org
Metal 3.2 - Forward+ - Using Device #0: Apple - Apple M1 Max (Apple7)
Requested V-Sync mode: Disabled
Project FPS: 54 (18.51 mspf)
Project FPS: 55 (18.18 mspf)
Project FPS: 54 (18.51 mspf)
Project FPS: 55 (18.18 mspf)
v4.4.dev3.official.f4af8201b: 15 fps
Godot.app/Contents/MacOS/Godot --disable-vsync --screen 1 --print-fps
Output:
Godot Engine v4.4.dev3.official.f4af8201b - https://godotengine.org
Metal 3.1 - Forward+ - Using Device #0: Apple - Apple M1 Max (Apple7)
Requested V-Sync mode: Disabled
Project FPS: 15 (66.66 mspf)
Project FPS: 15 (66.66 mspf)
Project FPS: 16 (62.50 mspf)
Project FPS: 15 (66.66 mspf)
v4.3.stable.official.77dcf97d8: 55 fps
[!NOTE]
Using Vulkan, as no Metal in this build
Godot.app/Contents/MacOS/Godot --disable-vsync --screen 1 --print-fps
Output:
Godot Engine v4.3.stable.official.77dcf97d8 - https://godotengine.org
Vulkan 1.2.283 - Forward+ - Using Device #0: Apple - Apple M1 Max
Requested V-Sync mode: Disabled
Project FPS: 55 (18.18 mspf)
Project FPS: 56 (17.85 mspf)
Project FPS: 56 (17.85 mspf)
Project FPS: 56 (17.85 mspf)
@Calinou / @clayjohn regarding Sprite rendering (1 unique texture per sprite) test: https://github.com/godotengine/godot/pull/97340#pullrequestreview-2374632529
The problem is that the uniform set cache is being thrashed (I knew that would be a problem with the right conditions). Bindless textures would be really nice…
Some options:
- Allow the cache size to be configurable, so that a user can adjust for their specific scenario. Its not perfect, but would give the user a simple option
- Provide an option to choose to the batch mode, and include single-texture batching, which for a game with 100s or thousands of unique (or dynamic) sprites, might be a better fit
- Do both
[!NOTE]
- is easy, and I think we should do it anyway, as that gives the user a way to address cache thrashing. Cache invalidations (or perhaps uniform set creation count) would be worth adding to the monitors / visual profiler too.
Whilst investigating, I've identified an issue that if the texture is changed:
https://github.com/godotengine/godot/blob/e2c6daf7eff6e0b7e2e8d967e95a9ad56e948231/servers/rendering/renderer_rd/storage_rd/texture_storage.cpp#L581-L597
TextureStorageRD invalidate caches; however, it won't invalidate the cache in the 2D renderer.
Another issue is that if a texture is freed, we would need to invalidate the 2D renderer caches. However, it is unclear if this is taken care of by the uniform_set_invalidation_callback, when a texture is freed? We subscribe to this for invalidating the cache.
EDIT: Yes, it does appear that if a texture is freed, the associated uniform set will be freed, as it is a direct dependency. I do see invalidation callbacks being called when a texture is freed. Texture updates are still an issue.
We would need to add a notification / callback so the 2D renderer can subscribe and invalidate any uniform sets when the texture is changed.
We may still have to limit the size of the cache, as draw order changes might still generate an unbounded number of uniform sets if the set of bound sprites for a given batch does not exist. Normally, the batching attempts to put the same texture in the same array slot, to reuse the same uniform set.
I can confirm the cache is invalidated when textures are freed, so we just need to address the changed issue.
Another pathological case is that if the ordering of nodes changes regularly, and the number of textures is high, it will result in uniform set churn. Not sure if this is a realistic scenario, as any solution we introduce is likely to perform poorly is some cases.
I added a timer to @Calinou's unique test case, and a 2D node named Sprites to add the 10,000 sprites:
func _on_timer_timeout() -> void:
var node = sprites.get_child(0)
sprites.remove_child(node)
node.queue_free()
sprites.get_children().shuffle()
var sprite := $Sprite2D.duplicate()
var tex := GradientTexture2D.new()
tex.width = 1
tex.height = 1
sprite.texture = tex
sprite.position.x = randf_range(0, 1152)
sprite.position.y = randf_range(0, 648)
sprites.add_child(sprite)