godot icon indicating copy to clipboard operation
godot copied to clipboard

Ubershaders and pipeline pre-compilation (and dedicated transfer queues).

Open DarioSamo opened this issue 1 year ago • 35 comments

This is a big PR with quite a bit of history that should be evaluated very thoroughly to evaluate where do we want to make some concessions and try to mitigate the side effects as much as possible. However, the benefits are essential to shipping games with the engine and making the final experience for users much better. To read further on @reduz's notes about the topic, you can check out these documents (Part 1 and Part 2).

Due to the complexity of this PR and how 4.3 is currently in feature freeze, I'd definitely not consider this PR until 4.3 is out. If you want the TL;DR: skip ahead to the two videos with the TPS demo to see the immediate difference.

NOTE: These improvements will only affect the Forward+ and Mobile renderers. No changes are expected for Compatibility.

Transfer queues

First of all, this PR supersedes the transfer queues PR and effectively uses it as its base. The reliance on needing to unlock parts of the behavior of RenderingDevice to make it multithread-friendly to reap the benefits was far too much to keep both PRs separate. As mentioned in that previous PR, merging it as is will cause a small performance regression unless https://github.com/godotengine/godot/pull/86333 is merged first.

Pipeline compilation

Modern APIs like Vulkan and D3D12 have made rendering pipeline management very explicit: their creation is no longer hidden behind the current rendering state and handled on demand by the driver. Instead, the developer must create the entire pipeline ahead of time and wait on a blocking operation that can take a significant amount of time depending on the complexity of the shader and the speed of the hardware. This has seen some improvements recently with the introduction of new extensions like VK_EXT_graphics_pipeline_library, but as always, Godot must engineer solutions aimed towards resolving the problem for as much hardware as possible and use such features optionally for optimization in the future.

Godot has the responsibility to perform as fast as possible for the end user, which leaves it no choice but to generate pipelines with the least amount of code and requirements as possible. The engine achieves this through the use of shader compilation macros (shader variants) and the use of specialization constants to optimize code for a particular pipeline (pipeline variants). While Godot resolves shader variant compilation and can even ship the shader cache to skip the step altogether, it coudn't resolve pipeline variant compilation ahead of time before this PR at all.

If you're familiar with the "stutters when playing the game for the first time" phenomenon that has plagued all games shipped with Godot 4's RD-based renderers, this is pretty much the entire root of the problem. This is not a problem exclusive to Godot as it's been very evident in lots of commercial releases that include very extensive shader pre-compilation steps the first time a game starts or a driver update happens. The issue is so prevalent even Digital Foundry points it out as the #1 problem plaguing PC game releases in this article and they never fail to mention the existence of the problem on any new game that suffers from it.

Ubershaders for Godot 4

The exciting part about this PR is an effective solution was developed to address this problem completely without the need to introduce extensive shader pre-compilation steps or any input from the game developer whatsoever. Instead, attempts have been made to make pipeline compilation a part of loading assets as much as possible. Not only does this mean most pipeline compilation is no longer resolved at drawing time, it can also even be done in background threads and presented as part of a regular loading screen. That means the game is no longer at the mercy of the renderer introducing these stutters when it needs to draw, but it makes the behavior much more predictable and able to be handled as part of a loading process.

The main improvement this PR makes is the introduction of ubershaders once more to the engine, but these are quite different from what was previously done in Godot 3. Unlike the previous version of the engine, these shaders do not correspond to generating text shaders with specializations and compiling them in the background, which could lead to a lot of CPU usage that'd take lots of time in weaker systems. Instead, ubershaders are mostly still very similar to the current shaders the engine already has, with a key difference: specialization constants are pulled from push constants instead. This means that the engine is able to use a version of the shader already that can be used for drawing immediately while the specialized version is generated in the background. Pipeline variants are much faster to generate like this instead of relying on runtime shader compilation to insert the constants as part of the shader text, as they work directly on the SPIR-V and skip the need to compile the shader from text again.

Specialization constants are a big part of how Godot optimizes pipelines, but they've been limited by parts of the design as to how many can actually be used. Any additional constant implied an explosion of variants that led to the pipeline cache structure getting even bigger (160 KB in just pointers in Forward+ for any single material in master at the moment!), and every new addition meant that if the state is very dynamic, stutters would occur due to extra pipeline compilation. This was quite evident in the Mobile renderer, which uses a specialization constant to disable lights if they're not used: as soon as a light popped up, then stutters due to pipeline compilation were inevitable.

With this change, a new simple hashing system for pipeline caching is introduced instead:

  • The required pipeline is requested in its specialized form from the cache.
  • If the pipeline is not available yet, compilation is started on the background without stalling the main thread.
  • The ubershader pipeline (which has been compiled at loading time) is used instead. The specialization constants are pushed as part of the push constant instead along with some other rasterization state parameters (like backface culling).

Pipeline compilation at loading time

The other key part behind the PR is the introduction of pipeline compilation of the ubershaders in two extra steps.

  • During the creation of the surface cache in the renderer (stutter on scene setup).
  • During loading of ArrayMeshes (can be pushed to a background thread).

The difference in how both of these changes work together is pretty evident on the TPS demo by simulating a clean run as an end user would see the first time they run the game. A big chunk of the stutters are gone, especially the one that happens the first time the character shoots, which is a typical case of a stutter that only happened at drawing time despite the effect being loaded in the scene tree already.

Both of these videos have pipeline caching disabled and the driver cache deleted between each run.

master (dc91479082bd9cd56cd74282573492fe1ba9618a)

https://github.com/godotengine/godot/assets/538504/4152255e-ee0c-409c-9fa9-06cde148a62a

transfer_and_pipelines

https://github.com/godotengine/godot/assets/538504/391eaa02-00bf-4986-8972-f1d44d80ab1b

It's also worth noting how the loading screen animation actually plays out more of the time instead of having one big stutter at the end due to the initial pipeline compilation at drawing time. These loading times are also significantly shortened by making multiple improvements to the behavior of both the shader and pipeline compiler, allowing it to multi-thread more effectively and use more of the system's resources.

The negatives (and how we can mitigate them)

As was expected, these benefits do not come for free. But there's multiple ways we can attempt to mitigate most of the extra cost and this is an area I'm open to feedback on and that we can further optimize in future versions as well.

  • An extra shader variant had to be introduced. @reduz has diligently always recommended against adding shader variants as it leads to a combinatory explosion, but this is one sacrifice that had to be made to allow ubershaders to exist. However, there's potential for reducing some of the existing shader variants to dynamic paths in the ubershader if possible. This is also significantly mitigated by improvements to the shader compiler's multithreading and it should be a non-issue for games that ship the SPIR-V shader cache.
  • Loading times are bound to be longer as pipeline compilation is pushed here instead. This is the intended effect and it's paying a cost upfront that would happen at drawing time otherwise (which is less preferable). That said, pipeline caching always plays a part here and it'll speed up loading times in later runs as it should.
  • Higher memory consumption from extra pipelines and shader variants being compiled that might go unused. This is sadly one cost that must be paid no matter what and can hopefully be mitigated by implementing better detection of features in use.

The biggest reason behind these negatives is the engine's flexibility. Features can be turned on and off without explicit operations from the user at a global level: a scene can be instanced to use VoxelGI while another one might use Lightmaps instead. As a matter of fact, this is exactly what the TPS demo does, so any run of the game must pre-compile the Lightmap variants because it can't know ahead of time which method the user has chosen without looking at the scene's contents, which is yet to be instanced during mesh loading.

One of the things I hope to improve while this PR is in progress is reducing the amount of variants that are pre-compiled as much as possible. Therefore it'd be great to gather feedback on which of these methods are most effective and how to implement them:

  • Detecting features that aren't used at the project level would help significantly. If a user never uses VoxelGI, we shouldn't be pre-compiling variants for it. The biggest culprits here that I could identify are features like separate specular and motion vectors. Adding some form of tracking somewhere at a global level so the engine can know ahead of time without having to instance scenes would be very helpful here.
  • Assuming features aren't enabled by default and going back to compile them if they are: this is actually something that's partially implemented with the 'advanced' shader groups already. Upgrading this to a per-feature detection could help a lot towards reducing pre-compilation and delegating it to the surface cache setup. If a developer wishes to properly delegate the pre-compilation during mesh loading, all they need to do is just instance scenes first with the appropriate features.
  • Allowing developers to opt in or out of variants to pre-compile at a global level instead. This is likely a very good solution for more experienced developers to fine-tune their game if it actually has a significant amount of shaders and materials that need it. This would likely be a simple set of toggles indicating which features shouldn't be compiled as the developer knows they'll never make use of them.

It's worth noting that under the current implementation, none of these leading to false positives will lead to the engine misbehaving: at worst, it just causes the drawing time stutters the current version already has.

Testing methodology

  • As of https://github.com/godotengine/godot/pull/90271 (which is currently merged), you can now disable the pipeline cache at the engine level (if the API supports it). It is essential to turn this setting off to do any comparisons and to simulate the experience of a clean run of a game.

image

  • Delete the driver pipeline cache. This is the last barrier of defense the driver has if the application doesn't implement a pipeline cache of its own. This heavily depends on the IHV and the platform (e.g. on Windows & NVIDIA it's located at %LocalAppData%/NVIDIA/GLCache). No test should be considered valid without deleting this cache first and foremost.

image

  • Run the game!

Trying to measure the results can be a bit tricky as the results are heavily dependent on the behavior you see in a project. As the benefits are more visually evident as seen in the videos, it is hard to measure the effects of pipeline compilation at drawing time as they present themselves as stutters that happen all throughout the game instead of one particular scenario.

New performance monitors

Some new statistics have been added to the performance monitors which should help verify without a shadow of a doubt if the pipeline pre-compilation is working as intended. There's four different pipeline compilation sources that are identified and they should help towards understanding where a extended loading time or stutter comes from.

image

Quoted from the documentation added by this PR:

  • RENDERING_INFO_PIPELINE_COMPILATIONS_MESH: Number of pipeline compilations that were triggered by loading meshes. These compilations will show up as longer loading times the first time a user runs the game and the pipeline is required.
  • RENDERING_INFO_PIPELINE_COMPILATIONS_SURFACE: Number of pipeline compilations that were triggered by building the surface cache before rendering the scene. These compilations will show up as a stutter when loading scenes the first time a user runs the game and the pipeline is required.
  • RENDERING_INFO_PIPELINE_COMPILATIONS_DRAW: Number of pipeline compilations that were triggered while drawing the scene. These compilations will show up as stutters during gameplay the first time a user runs the game and the pipeline is required.
  • RENDERING_INFO_PIPELINE_COMPILATIONS_SPECIALIZATION: Number of pipeline compilations that were triggered to optimize the current scene. These compilations are done in the background and should not cause any stutters whatsoever.

bugsquad edit: Fixes https://github.com/godotengine/godot/issues/61233

TODO

  • [x] Pass CI and address compatibility breakage (if there's any).
  • [x] Make sure compatibility renderer hasn't had any regressions from modifying the common classes.
  • [x] Address any multi-threading issues that can possibly arise from both transfer queues and this PR.
  • [x] Test D3D12 for any regressions or new issues introduced from multithreading.
  • [x] Evaluate further ways to reduce the amount of pipelines being pre-compiled.
  • [x] Evaluate any possible CPU-time regressions during drawing and how to mitigate them.
  • [x] Evaluate adding this improvement to Canvas Renderer as well.

Contributed by W4 Games. 🍀

DarioSamo avatar Apr 08 '24 17:04 DarioSamo

  • Allowing developers to opt in or out of variants to pre-compile at a global level instead. This is likely a very good solution for more experienced developers to fine-tune their game if it actually has a significant amount of shaders and materials that need it. This would likely be a simple set of toggles indicating which features shouldn't be compiled as the developer knows they'll never make use of them.

This resembles https://github.com/godotengine/godot-proposals/issues/5229 and https://github.com/godotengine/godot-proposals/issues/6497 a lot, although I haven't proposed it for VoxelGI and LightmapGI yet as these are not Environment or CameraEffects properties.

If such a setting is disabled, we can assume the user is OK with having runtime shader compilation occur the first time the setting is enabled (since they'll probably be in an options menu while doing so).

Calinou avatar Apr 08 '24 21:04 Calinou

Assuming features aren't enabled by default and going back to compile them if they are: this is actually something that's partially implemented with the 'advanced' shader groups already. Upgrading this to a per-feature detection could help a lot towards reducing pre-compilation and delegating it to the surface cache setup. If a developer wishes to properly delegate the pre-compilation during mesh loading, all they need to do is just instance scenes first with the appropriate features.

I gave this a shot and got pretty successful results. The current caveat is that pipeline compilation will be less likely to be triggered for resources loaded through a background thread in a loading screen unless the game features an scene first with the feature used in-place. If not, then it must defer the loading to the surface cache creation instead.

However the results are pretty good. The pre-compilation on the TPS demo has gone down significantly:

image

That's around 300 pipelines down from 650+ pipelines in the OP, pretty much doubling the speed of the initial load in the demo that I showcased in the video and still has no pipeline stutters during drawing. I haven't detected any regressions from implementing this yet but trying to find edge cases is still worth investigating.

https://github.com/godotengine/godot/assets/538504/9a2da5c7-a8d9-4d45-af49-fb316cd43c00

I still think we could use some global settings to fine-tune the behavior (e.g. automatically detect, always pre-compile, never pre-compile), but this gets us much closer to an ideal level of pre-compilations that I wanted to see from the start.

DarioSamo avatar Apr 11 '24 17:04 DarioSamo

I investigated Canvas Renderer support and the potential problems we'd have to fix to fully take advantage of it.

First off, Canvas Renderer does suffer from the exact same problem: pipelines are compiled at drawing time if necessary. However, the total amount of pipelines that this does happen on is fairly small. However, it's undeniable you can get stutters from behavior such as enabling and disabling lights in proximity of the elements.

I added the entire framework for supporting ubershaders but ultimately left it disabled for now for a few reasons even if it does work as intended.

  • The amount of pipelines that you can pre-compile from just the shader data is about six without taking into account the possibility of meshes and polygons. That's a lot of pipelines to pre-compile when the final amount usually ends up being far less. In one example project, the pre-compiled count was basically 24 while the specializations were merely 3. That's a lot of added loading time for very little benefit.
  • It's impossible for the checks I added to pre-compile pipelines ahead of time for polygons and meshes. The vertex attribute format needs to be known ahead of time with exact precision of the offsets and strides.
  • A real solution would involve some sort of scheme where we detect commands that get added and cached, pass those off to the renderer and pre-compile pipelines for the cached commands. However upon experimentation it seems whatever point I found that could be used as the hook was called way too often to be beneficial.

For now I'm leaning towards addressing other issues the PR currently has (such as an extra CPU cost due to a mutex I want to avoid), but if anyone has an example of a project that requires lots of different shaders, pipelines, is entirely 2D and suffers from stutters, that'd help to provide a good example of something I can use as a reference.

DarioSamo avatar Apr 22 '24 17:04 DarioSamo

PS: I wonder how this will interact with https://github.com/godotengine/godot/pull/88199 – does Metal make this approach possible?

Calinou avatar Apr 30 '24 00:04 Calinou

PS: I wonder how this will interact with #88199 – does Metal make this approach possible?

Should be completely fine as far as I know as the PR's approach is completely driver-agnostic. A lot of the changes on this one are just basically fixing a lot of stuff that wasn't thread safe, so it could expose some other bugs if part of the Metal driver assumed that wasn't gonna happen (which was a common issue in the D3D12 one but easily fixed).

DarioSamo avatar Apr 30 '24 00:04 DarioSamo

Hey, great work on this! It might need a bit more testing - during my first run of the TPS demo with this branch, I got a freeze for about a second when I fired a shot (but not when it landed). This doesn't seem to be reflected in the Compilations Draw stat, so I'm not sure whether this is actually a shader compilation stutter that just wasn't accounted for, or some issue with threading or loading.

The weird thing is it seems to happen based on when I fire the shot. If I try to fire a shot within a second of loading into the map, I get a momentary freeze. If I wait a couple seconds and then fire, it works fine. Tested a bunch of times. Timing seems to be related to when the flying ship enters view. This one:

2024-07-03 22_40_41-Third Person Shooter (TPS) Demo (DEBUG)

Video:

https://github.com/godotengine/godot/assets/34800072/948f4c80-3915-4014-95ac-65ff1baa3c4e

Windows 11 nVidia GeForce RTX 3070 nVidia driver version 556.12

KeyboardDanni avatar Jul 04 '24 03:07 KeyboardDanni

This doesn't seem to be reflected in the Compilations Draw stat

Did you check if any of the other numbers go up when you do that? Objects that don't get added to the surface cache for some reason before they're shown would still show stutters regardless of what this PR does. Surface cache compilations have their own counter shown in the monitor. Perhaps there was some behavior change that makes the bullet not get added to this surface cache before it shows up.

DarioSamo avatar Jul 04 '24 16:07 DarioSamo

@KeyboardDanni Can you also show what graphical settings you used? Are you using the project unmodified? Did you change the graphical settings in the settings menu inside the game? That can also affect whether the PR encounters all cases ahead of time.

DarioSamo avatar Jul 04 '24 17:07 DarioSamo

This doesn't seem to be reflected in the Compilations Draw stat

Did you check if any of the other numbers go up when you do that? Objects that don't get added to the surface cache for some reason before they're shown would still show stutters regardless of what this PR does. Surface cache compilations have their own counter shown in the monitor. Perhaps there was some behavior change that makes the bullet not get added to this surface cache before it shows up.

I can try, but the graphs only seem to be one second in resolution, which isn't enough to observe a difference when I have to fire the shot within a second or two of level start, and around that time there's more compilations going on so I don't know if any number changes might be related. But I'm open to suggestions and ideas on how to test this better. Perhaps breaking with a C++ debugger would help?

@KeyboardDanni Can you also show what graphical settings you used? Are you using the project unmodified? Did you change the graphical settings in the settings menu inside the game? That can also affect whether the PR encounters all cases ahead of time.

It's version 4.2-f0587b2 from the Asset Library. No modifications except project migration, disabling the Pipeline Cache and Shader Cache, and opening in windowed mode. Here are the graphics settings I used:

2024-07-03 22_28_02-Third Person Shooter (TPS) Demo (DEBUG)

KeyboardDanni avatar Jul 04 '24 17:07 KeyboardDanni

So I tested this again. Firing a shot for the first time seems to raise Compilations Surface by 2 regardless of whether the stutter occurs. Also, the length of the stutter is shorter if I wait longer to do it. So if I wait a second and a half instead of less than one second, the stutter only lasts for about half a second, almost as if it's waiting on some background processing to complete.

Video of me firing slightly later:

https://github.com/godotengine/godot/assets/34800072/568d5633-16c3-4f52-a6ed-a51f336ece34

Edit: Looks like it's caused by VoxelGI being enabled. If I use SDFGI I get no stutter after scene load. If I use LightmapGI the first frame after loading the scene takes a few seconds but otherwise I don't get any stutter. With VoxelGI, it seems like the first lighting effect forces a stutter (but only if done too early).

KeyboardDanni avatar Jul 04 '24 18:07 KeyboardDanni

Edit: Looks like it's caused by VoxelGI being enabled. If I use SDFGI I get no stutter after scene load. If I use LightmapGI the first frame after loading the scene takes a few seconds but otherwise I don't get any stutter. With VoxelGI, it seems like the first lighting effect forces a stutter (but only if done too early).

Thanks, when I get around to rebasing this for 4.4 I can investigate if VoxelGI is causing some pipelines to be missed.

DarioSamo avatar Jul 04 '24 21:07 DarioSamo

Reminder: If https://github.com/godotengine/godot/pull/94289 gets merged, a special exception will need to be made for multimesh handling.

DarioSamo avatar Jul 13 '24 02:07 DarioSamo

Godot v4.3.dev (c21f06204) - Debian GNU/Linux trixie/sid trixie - X11 - Vulkan (Forward+) - dedicated NVIDIA GeForce RTX 3080 Ti (nvidia) - AMD Ryzen 7 5800X 8-Core Processor (16 Threads)

driver: 535.183.06

I (finally) tried running this branch because my PR #91672 seems to be blocked in favor of these changes, but it falls over on that PR's test project. Although that project tests new functionality in the Mobile renderer, it should work fine in Vulkan.

There are three major classes of issues I ran into -- note I'm on Linux/X11, but using an NVIDIA card with the most recent driver Debian unstable supports thus far.

  1. Some shader combinations do not compile.

Under this PR, the test project from my PR fails to build shaders in Vulkan: stdout.txt.

(While the lightmap.lmbake file has a version mismatch in this PR's engine, deleting and regenerating it doesn't change the compilation situation, as expected.)

  1. Frequent livelocks under Linux (RenderingDeviceDriverVulkan::fence_wait, stuck in driver).

After encountering the above, I tried reconstructing a simpler project from scratch, by tweaking similar attributes in the material: emptyish.zip , which rendered in-editor fine until I exited (see below).

After that, trying to reload the project in the editor repeatedly hits a livelock in the libnvidia-glcore.

No shader errors were reported to the console, just a skipped ICD message and Vulkan banner as seen in normal usage on my system. Here's the stack trace (in thread 1 at the bottom of the file in all these thread dumps -- on review, it seems most of the other threads are in the same state almost all the time): bt_hang.txt.

Finally, I tried the inimitable TPS demo. I didn't see any shader compilation issues on loading, but the in-editor project load process hit similar hangs, also in the nvidia driver, waiting for a fence. Attempting to finish from the stack trace stalls indefinitely inside the driver. I had to kill the editor and reload the project three times, which finally succeeded (all resources finally imported). bt-tps.txt bt-tps2.txt

But then running the project led to a similar hang on the title screen once my mouse hovered over the Settings button. Same hang location, though occasionally different parent callstacks. bt-run-tps.txt

But, there is good news. The next time I ran the project, it worked flawlessly, with similar performance wrt. expecting stuttering/loading times as shown in the videos above. I was able to play for at least a minute without issue before exiting cleanly.

  1. Crashes on editor exit.

In a few of the times when I could run the editor and exit without killing it, I get a segfault with similar stack trace. (Maybe it is unrelated entirely, and maybe fixed since the Apr 22 baseline, but just FYI): stdout2.txt stdout3.txt


While testing these situations, I deleted the .godot folders between each run, by necessity due to differences in the baselines, and later also tried deleting the Nvidia GL caches (seems to have evolved from ~/.nv/GLCache -> ~/.cache/nv/GLCache -> ~/.cache/nvidia/GLCache on my system0 and turned off the Pipeline Cache setting just in case there is some relation, but it doesn't affect these results. I didn't delete .godot or clear shaders after finding the repeated source of the livelock (except when swapping engines) since that only exacerbated the hangs.

Needless to say, all the projects load fine without shader issues or hangs in 4.3 stable. TPS Demo from clean project, run, and editor exit is summed up in: bt-tps-stable.txt

I hope these turn out to be relatively simple issues to resolve in this branch. This is very exciting work!

eswartz avatar Aug 25 '24 17:08 eswartz

Pushed a big rebase, will re-run tests and verifications soon before marking it as ready for review.

DarioSamo avatar Sep 19 '24 19:09 DarioSamo

Edit: Looks like it's caused by VoxelGI being enabled. If I use SDFGI I get no stutter after scene load. If I use LightmapGI the first frame after loading the scene takes a few seconds but otherwise I don't get any stutter. With VoxelGI, it seems like the first lighting effect forces a stutter (but only if done too early).

I took a look at this and I'm unable to find any reason why SDFGI and VoxelGI would behave different. In fact, SDFGI has even more pipeline compilations when the shooting is triggered on surface: VoxelGI goes up by 2 and SDFGI goes up by 4.

However, Clay and I found the following information about it.

  • Player only instances the bullet when you shoot, which means bullet.tscn is never loaded into the scene ahead of time. Pipelines are compiled when the object is instanced. https://github.com/godotengine/tps-demo/blob/34b6977f3524a0ad513bda0bc06bea27e795c154/player/player.gd#L132
  • Preloading the scene doesn't seem to help, most likely because loading the mesh resource itself isn't enough information to find the pipeline ahead of time. This can happen for a multitude of reasons which we're trying to determine the possible reasons as to why yet. However, in case the stutter is unrelated, you might want to verify if it's possibly related.
  • The delta of pipelines is very small, which hardly sounds like a new surface getting added is triggering a lot of these pipelines.

I'll keep investigating the possible reason, but the problem does go away if an invisible bullet scene is instanced into the tree ahead of time. That said, I can't reproduce the stutter myself, but I do see the pipelines increasing at least.

DarioSamo avatar Sep 20 '24 17:09 DarioSamo

The delta of pipelines is very small, which hardly sounds like a new surface getting added is triggering a lot of these pipelines.

Indeed the problem is just the fact that an effect is not enabled by the time it preloads the bullet scene, so it doesn't create the pipelines necessary for either SDFGI or VoxelGI.

This is the problem I brought up that could happen before.

The current caveat is that pipeline compilation will be less likely to be triggered for resources loaded through a background thread in a loading screen unless the game features an scene first with the feature used in-place. If not, then it must defer the loading to the surface cache creation instead.

DarioSamo avatar Sep 20 '24 17:09 DarioSamo

I get what's going on now.

TPS Demo has this on the title screen.

image

This is an attempt by the game to cache the pipelines ahead of time for the bullet. However, the title screen does not use SDFGI or VoxelGI, so the ubershader pipeline pre-compilation is not detecting it should create these pipelines.

The extra two to four pipelines we're seeing on the first shot are the new pipelines encountered due to the fact the effect only gets enabled when starting the game, but the instancing of this bullet ahead of time is not done there instead.

Merely instancing the bullet scene as hidden would fix this issue immediately, as hidden surfaces are added to the cache.

I can't really think of many mechanisms to fix it and I'd argue it's a bit of a content issue, as the caching mechanism in the demo was built specifically around the lack of this functionality.

@KeyboardDanni I think it's pretty justified that the demo itself could be edited to fix this, probably through instancing the hidden version of the bullet somewhere (maybe on the player itself).

DarioSamo avatar Sep 20 '24 19:09 DarioSamo

I've tested locally with the TPS demo and the Legend of the Nuku Warriors demo

TPS Demo

I can reproduce the issue that Calinou flagged above with the core particles turning opaque. I can reproduce the issue in the editor by going to the level scene and flying the camera into the core. If I go to the core scene the issue goes away. Same if I make the children editable in the level scene, then hide the particles and show them again.

I cannot reproduce the crash

All cases of stuttering are gone for me in this demo (with the bullet loading fix from Dario)

Nuku

When running the Nuku warriors demo I get a deadlock during scene loading. I tested 2be730a05b7ff221b89c967981f7caee6e164ef0 (the commit before this one) and I get the same deadlock. So that is an existing problem in master

edit: by forcing single threaded loading I can run the demo. I still get a big stutter when I first turn around, but all the other stutters are gone. On master (and with this PR) I am getting a deadlock during import (and due to a different bug, it force reimports every time I open the editor) so I can't run it from the editor and check to see if a shader compilation is happening at run time

clayjohn avatar Sep 25 '24 02:09 clayjohn

I've seen the visual issue in the TPS demo so I'll try to reproduce and investigate.

When I tested Nuku back in April the stutters were gone, but I'll double check if some other change happened in the middle that could cause pipelines to be missed.

DarioSamo avatar Sep 25 '24 11:09 DarioSamo

I got a crash once when opening the TPS demo in the editor. I could only reproduce this once after removing ~/.cache/nvidia/. I couldn't reproduce this again after trying to follow the same steps.

@Calinou Your crash in particular seems to be on the shader compiler, so taking out your SPIR-V cache seems like the most likely way to reproduce it. That said, it seems to be quite a few levels deep down. I've not reviewed the possibility of glslang having some shared state between threads yet that could lead to corruption but it shouldn't be the case.

DarioSamo avatar Sep 25 '24 11:09 DarioSamo

The transparency issue should be fixed. There was a local variable to the material that was only being updated during pipeline creation, introducing a possible race condition if the draw list happened to insert it first. This has been moved to only be updated on set_code as it was before. No local variable changes other than the pipeline hash map are allowed during pipeline creation.

DarioSamo avatar Sep 25 '24 13:09 DarioSamo

I still get a big stutter when I first turn around, but all the other stutters are gone.

So far this one seems to be just the fact that the usage of light cubemaps is not triggered until the camera is turned around. A bit of a content issue but perhaps we can mitigate these by being able to know ahead of time from some other means whether light cubemaps or atlases are used in some other way. Perhaps light storage could keep track of this information somehow or notify the renderer when a new shadow mode is used? Or just keep track of the generated lights altogether and we use the dependencies to know?

DarioSamo avatar Sep 25 '24 13:09 DarioSamo

I've improved the light detection for variants to be ahead of time by making light storage toggle these variants when the shadow types are set. This gets rid of stutters when the light variant is not in view yet on the scene.

DarioSamo avatar Sep 25 '24 14:09 DarioSamo

================================================================
handle_crash: Program crashed with signal 11
Engine version: Godot Engine v4.4.dev.custom_build (fdac73548424c59899cb8cde26c551570107ebca)
Dumping the backtrace. Please include this when reporting the bug to the project developer.
[1] /lib64/libc.so.6(+0x40d00) [0x7facce5c5d00] (??:0)
[2] bin/godot.linuxbsd.editor.x86_64() [0xa82fb2] (/usr/include/c++/14/bits/char_traits.h:427)
[3] bin/godot.linuxbsd.editor.x86_64() [0xabe70a] (/home/hugo/Documents/Git/godotengine/godot/thirdparty/glslang/glslang/MachineIndependent/Scan.cpp:933)
[4] bin/godot.linuxbsd.editor.x86_64() [0xac1850] (/home/hugo/Documents/Git/godotengine/godot/thirdparty/glslang/glslang/MachineIndependent/Scan.cpp:296 (discriminator 2))
[5] bin/godot.linuxbsd.editor.x86_64() [0xb0049b] (/home/hugo/Documents/Git/godotengine/godot/thirdparty/glslang/glslang/MachineIndependent/ParseHelper.cpp:199)
[6] bin/godot.linuxbsd.editor.x86_64() [0x99098a] (/home/hugo/Documents/Git/godotengine/godot/thirdparty/glslang/glslang/MachineIndependent/ShaderLang.cpp:1212)
[7] bin/godot.linuxbsd.editor.x86_64() [0x3bde5a4] (/home/hugo/Documents/Git/godotengine/godot/./servers/rendering/rendering_device.cpp:215)
[8] bin/godot.linuxbsd.editor.x86_64() [0x3db4338] (/home/hugo/Documents/Git/godotengine/godot/./core/templates/cowdata.h:461)
[9] bin/godot.linuxbsd.editor.x86_64() [0x48fbea6] (/home/hugo/Documents/Git/godotengine/godot/./core/object/worker_thread_pool.cpp:101)
[10] bin/godot.linuxbsd.editor.x86_64() [0x48fc6d0] (/home/hugo/Documents/Git/godotengine/godot/./core/object/worker_thread_pool.cpp:205)
[11] bin/godot.linuxbsd.editor.x86_64() [0x42c9e3d] (/home/hugo/Documents/Git/godotengine/godot/./core/os/thread.cpp:64)
[12] bin/godot.linuxbsd.editor.x86_64() [0x51558a4] (thread.o:?)
[13] /lib64/libc.so.6(+0x976d7) [0x7facce61c6d7] (??:0)
[14] /lib64/libc.so.6(+0x11b60c) [0x7facce6a060c] (??:0)
-- END OF BACKTRACE --
================================================================
[1]    53286 IOT instruction (core dumped)  bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo -e

I'm not really able to gather much about this one. The crash would imply glslang has some internal error but a brief review about it gives me no shared context whatsoever that could be causing the issue. It'd be no different of a situation from what was there before this PR, it's just that this PR is able to parallelize shader compilation more effectively so it sounds likely an existing error is just more likely to crop up when the engine can take advantage of more threads for shader compilation.

DarioSamo avatar Sep 25 '24 16:09 DarioSamo

I've made https://github.com/godotengine/godot/pull/97465 as a new related PR to avoid a performance regression from this PR.

DarioSamo avatar Sep 25 '24 17:09 DarioSamo

I feel like there should be some user-level control over which ubershader pipelines are compiled ahead of time. I appreciate that there's runtime heuristics but there's going to be corner cases that users will have to work around by pre-instantiating objects. I'd think a good implementation of ubershader pipeline pre-compilation would avoid needing these workarounds.

It'd be great if there were some sort of Resource or ProjectSetting that could be used to fine-tune pre-compilation. For example, if you know that you'll need VoxelGI or certain light types in your scene, you could specify that directly.

Having said that, even without direct control over pre-compilation, ubershaders should at least simplify making a "compiling shaders" screen since you won't have to instantiate all the material variants.

KeyboardDanni avatar Sep 25 '24 20:09 KeyboardDanni

Not sure if you missed my comment, but I'm still seeing the mostly reliable livelock behavior as I mentioned before, with all the same kinds of stack traces. The fade test project I linked uses a lot of different materials in one scene and seems to trigger (threading?) issues quite easily.

Updated stacks: stack.txt (in thread 1) stack2.txt (in thread 10)

Occasionally I can run properly, and when editing or running that project, this PR spits out a repeatable series of errors (while seeing to render properly nonetheless):

Godot Engine v4.4.dev.custom_build.ccb0fb316 (2024-09-25 16:53:21 UTC) - https://godotengine.org
WARNING: GENERAL - Message Id Number: 0 | Message Id Name: Loader Message
	terminator_CreateInstance: Received return code -3 from call to vkCreateInstance in ICD /usr/lib/x86_64-linux-gnu/libvulkan_virtio.so. Skipping this driver.
	Objects - 1
		Object[0] - VK_OBJECT_TYPE_INSTANCE, Handle 94045832866768
     at: _debug_messenger_callback (drivers/vulkan/rendering_context_driver_vulkan.cpp:639)
Vulkan 1.3.242 - Forward+ - Using Device #0: NVIDIA - NVIDIA GeForce RTX 3080 Ti

ERROR: Mismatch fragment shader output mask (0) and framebuffer color output mask (1) when binding both in render pipeline.
   at: render_pipeline_create (servers/rendering/rendering_device.cpp:3452)
ERROR: Mismatch fragment shader output mask (0) and framebuffer color output mask (1) when binding both in render pipeline.
   at: render_pipeline_create (servers/rendering/rendering_device.cpp:3452)
ERROR: Condition "pipeline.is_null()" is true.
   at: _create_pipeline (servers/rendering/renderer_rd/forward_clustered/scene_shader_forward_clustered.cpp:381)
ERROR: Condition "pipeline.is_null()" is true.
   at: _create_pipeline (servers/rendering/renderer_rd/forward_clustered/scene_shader_forward_clustered.cpp:381)
ERROR: This render pipeline requires (0) bytes of push constant data, supplied: (32)
   at: draw_list_set_push_constant (servers/rendering/rendering_device.cpp:4298)
ERROR: No render pipeline was set before attempting to draw.
   at: draw_list_draw (servers/rendering/rendering_device.cpp:4319)
ERROR: Mismatch fragment shader output mask (0) and framebuffer color output mask (1) when binding both in render pipeline.
   at: render_pipeline_create (servers/rendering/rendering_device.cpp:3452)
ERROR: Mismatch fragment shader output mask (0) and framebuffer color output mask (1) when binding both in render pipeline.
   at: render_pipeline_create (servers/rendering/rendering_device.cpp:3452)
ERROR: Condition "pipeline.is_null()" is true.
   at: _create_pipeline (servers/rendering/renderer_rd/forward_clustered/scene_shader_forward_clustered.cpp:381)
ERROR: Condition "pipeline.is_null()" is true.
   at: _create_pipeline (servers/rendering/renderer_rd/forward_clustered/scene_shader_forward_clustered.cpp:381)
ERROR: This render pipeline requires (0) bytes of push constant data, supplied: (32)
   at: draw_list_set_push_constant (servers/rendering/rendering_device.cpp:4298)
ERROR: No render pipeline was set before attempting to draw.
   at: draw_list_draw (servers/rendering/rendering_device.cpp:4319)
ERROR: The vertex format used to create the pipeline does not match the vertex format bound.
   at: draw_list_draw (servers/rendering/rendering_device.cpp:4326)
ERROR: The vertex format used to create the pipeline does not match the vertex format bound.
   at: draw_list_draw (servers/rendering/rendering_device.cpp:4326)
ERROR: This render pipeline requires (16) bytes of push constant data, supplied: (32)
   at: draw_list_set_push_constant (servers/rendering/rendering_device.cpp:4298)
ERROR: The vertex format used to create the pipeline does not match the vertex format bound.
   at: draw_list_draw (servers/rendering/rendering_device.cpp:4326)
ERROR: This render pipeline requires (16) bytes of push constant data, supplied: (32)
   at: draw_list_set_push_constant (servers/rendering/rendering_device.cpp:4298)
ERROR: The vertex format used to create the pipeline does not match the vertex format bound.
   at: draw_list_draw (servers/rendering/rendering_device.cpp:4326)
^C
Debian GNU/Linux trixie/sid trixie - X11 - Vulkan (Mobile) - dedicated NVIDIA GeForce RTX 3080 Ti (nvidia) - AMD Ryzen 7 5800X 8-Core Processor (16 Threads)

eswartz avatar Sep 25 '24 20:09 eswartz

Not sure if you missed my comment, but I'm still seeing the mostly reliable livelock behavior as I mentioned before, with all the same kinds of stack traces. The fade test project I linked uses a lot of different materials in one scene and seems to trigger (threading?) issues quite easily.

I haven't missed it, but some the information in the comment did not seem very related to the PR and it was done on a version several months behind of master. I feel some of those reports need further validation of whether something is related to the PR or not.

I do however suspect the one where pipelines are unable to be created with explicit errors from RenderingDevice is worth a look. I'll take a look at the test project specifically for that.

But I've done pretty extensive testing on the threading logic and most of it is managed by the WorkerThreadPool class of the engine, so I don't really get much use out of the stack traces and have little control over it from this PR's scope if there's issues with its management. It might be an area you'll need to investigate yourself if I can't find much else about it.

DarioSamo avatar Sep 25 '24 21:09 DarioSamo

It'd be great if there were some sort of Resource or ProjectSetting that could be used to fine-tune pre-compilation. For example, if you know that you'll need VoxelGI or certain light types in your scene, you could specify that directly.

There's most likely plans to add something like this to aid packing of pre-compiled shaders into exported projects, so such a setting will most likely be borrowed from for a similar purpose. It'll most likely be done after this PR though.

DarioSamo avatar Sep 25 '24 21:09 DarioSamo

I haven't missed it, but some the information in the comment did not seem very related to the PR and it was done on a version several months behind of master. I feel some of those reports need further validation of whether something is related to the PR or not.

I did run against master and the project works fine in all ways. You've rebased against master as well, so only these changes are likely involved. I wouldn't have reported the issue against this PR otherwise. :)

eswartz avatar Sep 25 '24 22:09 eswartz