This PR builds on #92797, to improve the 2D renderer, such that it can combine commands that change texture state, allowing for more cases to be rendered as a single draw call.

Using driver limits, the current implementation will support up to 64 textures in a single batch. Additionally, samplers have been moved back to an array for canvas rendering, so batching can dynamically address the default samplers like SAMPLER_NEAREST_CLAMP. 3 additional sampler states are supported that are not available in the default samplers, such as mirror-repeat, bringing the per batch to a maximum of 15 sampler states.

Another important change in this PR is to move the specular_shininess to PushConstant state, allowing the field to be used for #97058.

Bugfixes

As noted in this comment, fixes #97586 for the forward+ renderer, but not compatibility
Closes #97593 which is a forward+ issue only
Fixes #97774
Fixes #98005

Background

One benefactor of this improvement is text rendering, which changes texture state to render effects such as shadows. A demonstration of that was a comment by @passivstar. With this PR, we can see that the number of draw calls was reduced from 40,332 down to 4, which an associated frame-rate increase:

master	This PR
40 fps	400+ fps
40332 draw calls	4 draw calls

Cases where unique textures are used for sprites may also benefit from this PR. To demonstrate this, I modified @Calinou's test project to draw 10,000 sprites, from this comment, to alternate between two sprites using different textures.

master	This PR
214 fps	469 fps
10004 draw calls	1 draw call

[!NOTE]

Also shows that the label is combined into the same draw call, so the entire scene is rendered in a single draw call.

Sep 22 '24 20:09 stuartcarnie

@clayjohn I have not converted the SAMPLERS back to an array, but if we did, we could use that for sampler state dynamically, rather than a separate array. I would extend the length from 12 to 16, so that the 2D renderer could add up to 4 unique sampler states per draw call, that aren't available in the default list. That is probably MIRROR_REPEAT, which isn't available in the default sampler states.

Sep 22 '24 20:09 stuartcarnie

Also note that this PR will help with moving to batching multiple command types together.

Sep 22 '24 20:09 stuartcarnie

Preliminary tests are very good. Running the scene in https://github.com/godotengine/godot-demo-projects/tree/master/2d/platformer It goes from 566 batches down to 33!

Sep 23 '24 01:09 clayjohn

No text appears when GODOT_2D_TEX_BATCHING_ENABLED=1 is used to run Godot, so the scene appears just gray. This occurs both with Vulkan and Direct3D 12 (but not OpenGL). The text and its shadow appear normally without this variable.

This is caused by a shader error. I'm about to paste a diff that will fix it.

With your MRP I go from 17 FPS to 90 FPS :)

Sep 23 '24 01:09 clayjohn

No text appears when GODOT_2D_TEX_BATCHING_ENABLED=1 is used to run Godot, so the scene appears just gray. This occurs both with Vulkan and Direct3D 12 (but not OpenGL). The text and its shadow appear normally without this variable.

This is caused by a shader error. I'm about to paste a diff that will fix it.

With your MRP I go from 17 FPS to 90 FPS :)

Awesome, thanks @clayjohn!

Sep 23 '24 02:09 stuartcarnie

instance-changes.txt

This diff does two things:

changes the specular_shininess property to batch_indexes when using USE_BATCH_TEXTURES
Combines the instance uniform set with the canvas texture uniform set so the total number of uniform sets is 4 (that fixes https://github.com/godotengine/godot/issues/97341)

This diff works, but there should be some cleanup after to change _bind_canvas_texture() to something else and rename diffuse_or_uniform_set

Sep 23 '24 02:09 clayjohn

@clayjohn thanks – I merged your changes. I'll remove the environment check and tidy up the code, as then I will adjust the _bind_canvas_texture function.

Sep 23 '24 20:09 stuartcarnie

Latest changes no longer use the push_constant, per recommendation from @clayjohn. Thanks for the assist!

Sep 28 '24 00:09 stuartcarnie

I am going to remove the editor property for the texture data buffer size. I will calculate it dynamically, as it will never be very big

Sep 28 '24 04:09 stuartcarnie

NOTE

I have removed the editor property as the implementation dynamically adjusts the buffer size to settle on a good default.

Also addresses to bugs in RD batching.

Sep 28 '24 20:09 stuartcarnie

Will need a rebase after #97554 - sorry that one might be a bit painful, but we wanted to fix the regression in priority for 4.4.dev3.

Sep 28 '24 22:09 akien-mga

@akien-mga rebased and conflicts have been resolved

Sep 29 '24 04:09 stuartcarnie

@AThousandShips I've renamed indexes → indices

Oct 04 '24 20:10 stuartcarnie

Using @clayjohn's words, with this one simple trick, frame rate for his pathological test case jumped from 25 fps to 45 fps on my M1.

The improvement was to remove all the allocations building the RD::Uniform arrays each batch. I dropped the number of transient allocations from 2.9 million for a 3-second window:

CleanShot 2024-10-05 at 06 15 17@2x

to 8,796 over the same periond

CleanShot 2024-10-05 at 06 15 32@2x

Oct 04 '24 20:10 stuartcarnie

We can see that removing the append_id gained us 40% back on the _render_batch function, with this before snapshot:

CleanShot 2024-10-05 at 06 31 37@2x

An this after:

CleanShot 2024-10-05 at 06 31 44@2x

Next I want to focus on the get_cache call, as I think we can optimise for our use case in the 2D renderer. It isn't critical for this PR, but we can see that if we focus on just the _render_batch call, 80% of the usage is charged to the get_cache call:

CleanShot 2024-10-05 at 06 36 47@2x

It has to be very generic and compare all the uniform state, including the binding, type, number of RIDs and if the RIDs are the same. We can do better for these two uniforms in the canvas renderer I believe

Oct 04 '24 20:10 stuartcarnie

I just checked out this PR; I'm getting weird artifacts in the editor and in game (not reproducable in 4.4 dev 3). ~~Might be related to Wayland?~~ reproduced it on Gnome x11.

Specs: Godot v4.4.dev (d9b1bd1af) - Pop!_OS 22.04 LTS on Wayland - X11 display driver, Single-window, 2 monitors - Vulkan (Forward+) - dedicated AMD Radeon RX 6800 XT (RADV NAVI21) - AMD Ryzen 9 5900X 12-Core Processor (24 threads)

MRP: Here is a somewhat minimal project (it was a small prototype) in which it happens for me: drone_prototype_compressed.zip

Looks like this issue only occurs if the project was started in v4.3, or 4.4 dev 3 (versions prior to this PR) and then subsequently be opened in this PR. So, if you make a new project within this PR it doesn't happen.

Oct 07 '24 00:10 kyle-wannacott

If you delete the .godot folder containing the shader cache, does that solve it?

Oct 07 '24 03:10 huwpascoe

If you delete the .godot folder containing the shader cache, does that solve it?

Nope deleted the .godot; doesn't seem to make any difference.

Oct 07 '24 03:10 kyle-wannacott

@LeeWannacott You would also need to clear all the godot shader caches and your driver shader cache to be sure it isn't a shader issue.

On PopOS the shader caches are stored at:

~/.local/share/godot/shader_cache (for built in shaders)
~/.local/share/godot/app_userdata/your_project_name/shader_cache
~/.local/share/godot/app_userdata/your_project_name/vulkan

I can't find where AMD stores their shader cache on PopOS, but I would try deleting the above and see if it fixes things

Oct 08 '24 17:10 clayjohn

@clayjohn Deleted 1., 2. ,3. and .godot in the projects folder and deleted ~/.cache/mesa_shader_cache (which I think is where AMD stores their cache?)

Still got the same issue. If I open the project with 4.4 dev 3 its perfectly fine, it only happens when I open with godot built from this PR.

Update: Sorry, I was wrong. It does actually happen when making a new project with this PR, but only when selecting forward+ renderer, selecting compatibility is fine (which mislead me).

I'll try the PR out on Windows.

Oct 08 '24 18:10 kyle-wannacott

If it happens in a new project then the shader cache definitely isn't the problem. Thanks for helping us rule that out.

Most likely then the change to the shader has exposed a driver bug with your GPU.

Oct 08 '24 20:10 clayjohn

Most likely then the change to the shader has exposed a driver bug with your GPU

Yep looks like it, Managed to reproduce it on Windows (same hardware).

Oct 08 '24 21:10 kyle-wannacott

Most likely then the change to the shader has exposed a driver bug with your GPU

Yep looks like it, Managed to reproduce it on Windows (same hardware).

That seems surprising, as the drivers would be different on Windows.

Oct 08 '24 22:10 stuartcarnie

Latest push

Reimplements debug draw support to close #98005 as demonstrated in this comment
Includes additional optimisations to remove the uniform cache, and uses the LRUCache to limit maximum number of cached BATCH_UNIFORM_SET instances to 128

The improvement to remove the uniform cache and replace it with an LRUCache further improved the drawcalltest pathological case from 45 fps to 75 fps. The following screen shot shows a significant reduction in CPU-bound work where ❶ uses the uniform cache and ❷ uses a local cache[^1]:

CleanShot 2024-10-13 at 09 15 40@2x

Other benchmarks, that benefit from batching are largely unaffected by the change.

[^1]: Gc is giga-cycles

Oct 12 '24 22:10 stuartcarnie

Thanks @Calinou – I'll look at your third test case now

Oct 17 '24 19:10 stuartcarnie

@Calinou Curiously, I am seeing the opposite, that this PR improves over the official dev 4.4 build (prior to this PR). However, it does appear that #98212 is a significant improvement (within Metal), and may have had a big impact.

[!NOTE]

I tried Vulkan, and I am seeing the same performance between this PR and the dev 4.4 offical.

This PR: ~54 fps

Godot.app/Contents/MacOS/Godot --disable-vsync --screen 1 --print-fps

Output:

Godot Engine v4.4.dev.custom_build.2039402ac (2024-10-08 07:03:19 UTC) - https://godotengine.org
Metal 3.2 - Forward+ - Using Device #0: Apple - Apple M1 Max (Apple7)
Requested V-Sync mode: Disabled

Project FPS: 54 (18.51 mspf)
Project FPS: 55 (18.18 mspf)
Project FPS: 54 (18.51 mspf)
Project FPS: 55 (18.18 mspf)

v4.4.dev3.official.f4af8201b: 15 fps

Godot.app/Contents/MacOS/Godot --disable-vsync --screen 1 --print-fps

Output:

Godot Engine v4.4.dev3.official.f4af8201b - https://godotengine.org
Metal 3.1 - Forward+ - Using Device #0: Apple - Apple M1 Max (Apple7)
Requested V-Sync mode: Disabled

Project FPS: 15 (66.66 mspf)
Project FPS: 15 (66.66 mspf)
Project FPS: 16 (62.50 mspf)
Project FPS: 15 (66.66 mspf)

v4.3.stable.official.77dcf97d8: 55 fps

[!NOTE]

Using Vulkan, as no Metal in this build

Godot.app/Contents/MacOS/Godot --disable-vsync --screen 1 --print-fps

Output:

Godot Engine v4.3.stable.official.77dcf97d8 - https://godotengine.org
Vulkan 1.2.283 - Forward+ - Using Device #0: Apple - Apple M1 Max
Requested V-Sync mode: Disabled

Project FPS: 55 (18.18 mspf)
Project FPS: 56 (17.85 mspf)
Project FPS: 56 (17.85 mspf)
Project FPS: 56 (17.85 mspf)

Oct 17 '24 19:10 stuartcarnie

@Calinou / @clayjohn regarding Sprite rendering (1 unique texture per sprite) test: https://github.com/godotengine/godot/pull/97340#pullrequestreview-2374632529

The problem is that the uniform set cache is being thrashed (I knew that would be a problem with the right conditions). Bindless textures would be really nice…

Some options:

Allow the cache size to be configurable, so that a user can adjust for their specific scenario. Its not perfect, but would give the user a simple option
Provide an option to choose to the batch mode, and include single-texture batching, which for a game with 100s or thousands of unique (or dynamic) sprites, might be a better fit
Do both

[!NOTE]

is easy, and I think we should do it anyway, as that gives the user a way to address cache thrashing. Cache invalidations (or perhaps uniform set creation count) would be worth adding to the monitors / visual profiler too.

Oct 17 '24 20:10 stuartcarnie

Whilst investigating, I've identified an issue that if the texture is changed:

https://github.com/godotengine/godot/blob/e2c6daf7eff6e0b7e2e8d967e95a9ad56e948231/servers/rendering/renderer_rd/storage_rd/texture_storage.cpp#L581-L597

TextureStorageRD invalidate caches; however, it won't invalidate the cache in the 2D renderer.

Another issue is that if a texture is freed, we would need to invalidate the 2D renderer caches. However, it is unclear if this is taken care of by the uniform_set_invalidation_callback, when a texture is freed? We subscribe to this for invalidating the cache.

EDIT: Yes, it does appear that if a texture is freed, the associated uniform set will be freed, as it is a direct dependency. I do see invalidation callbacks being called when a texture is freed. Texture updates are still an issue.

We would need to add a notification / callback so the 2D renderer can subscribe and invalidate any uniform sets when the texture is changed.

We may still have to limit the size of the cache, as draw order changes might still generate an unbounded number of uniform sets if the set of bound sprites for a given batch does not exist. Normally, the batching attempts to put the same texture in the same array slot, to reuse the same uniform set.

Oct 17 '24 20:10 stuartcarnie

I can confirm the cache is invalidated when textures are freed, so we just need to address the changed issue.

Oct 17 '24 20:10 stuartcarnie

Another pathological case is that if the ordering of nodes changes regularly, and the number of textures is high, it will result in uniform set churn. Not sure if this is a realistic scenario, as any solution we introduce is likely to perform poorly is some cases.

I added a timer to @Calinou's unique test case, and a 2D node named Sprites to add the 10,000 sprites:

func _on_timer_timeout() -> void:
	var node = sprites.get_child(0)
	sprites.remove_child(node)
	node.queue_free()
	
	sprites.get_children().shuffle()
	
	var sprite := $Sprite2D.duplicate()
	var tex := GradientTexture2D.new()
	tex.width = 1
	tex.height = 1
	sprite.texture = tex
	sprite.position.x = randf_range(0, 1152)
	sprite.position.y = randf_range(0, 648)
	sprites.add_child(sprite)

Oct 17 '24 21:10 stuartcarnie

2D: Combine texture state to batch more subsequent commands together

Bugfixes

Background

NOTE

This PR: ~54 fps

v4.4.dev3.official.f4af8201b: 15 fps

v4.3.stable.official.77dcf97d8: 55 fps