godot Add support for enhanced barriers in D3D12.

Enhanced barriers are a relatively modern feature that wasn't part of the initial D3D12 specification. This feature brings barriers in D3D12 much closer in functionality to Vulkan (with a few caveats this PR works around).

The current D3D12 driver was designed around legacy barriers, and this PR maintains that path as it is. Enhanced barriers are only used if the driver reports them as supported. To enable this, you might need to use the Agility SDK and use it as indicated here. If you wanna double check if this is reported, you can check by verifying the return of this line.

By implementing enhanced barriers, that means the D3D12 driver can finally use the information provided by the recently introduced Acyclic Rendering Graph to group barriers together as much as possible and allow more commands to execute in parallel.

So far this has given me a real performance increase in an NVIDIA GeForce RTX 3090 Ti. A 3D project with about every post processing effect enabled went from ~2.57ms in master to ~2.16ms in d3d12_enhanced_barriers, an improvement very much in line with the performance benefits we saw when the render graph was introduced.

The second part of this PR was reworking the fallback that the D3D12 driver had to do into something more generic. When relaxed format casting is not available, the driver has to opt for making copies of textures, transitioning to the proper layouts and replacing them in descriptor sets without the render graph knowing any of that is going on. This fallback was instead re-implemented entirely at the RenderingDevice level, allowing the ARG to reorganize the operations and also get a performance boost. This also allows enhanced barriers to work with this fallback without issue, as it would've been impossible to solve it properly otherwise.

Some small changes had to be made to RenderingDeviceDriver to give more detail about the operations and states the resources are used to comply with D3D12's restrictions. No regressions are expected but it should be tested thoroughly regardless.

May 09 '24 15:05 DarioSamo

* `driver_clear_resources_with_views` and the other bool member above could become bitfields.

Admittedly the purpose of these is just to store one flag and be consistent throughout the execution of the graph without having to constantly call the api_trait_get and go through the cost of using the virtual function. I'm not really very much for or against the approach chosen here (the alternative being just querying the API or using the bitfield), but I went with whatever requires the least amount of knowledge from the reader. The graph is only created once per RenderingDevice so there's not much in the way of memory savings to be found.

* Also for `driver_clear_resources_with_views`, I'd suggest a rename to the more compact `driver_uses_views_for_clears`. Or, alternatively, inverting its meaning and naming it `driver_clears_via_copy_engine`, or something similar. (The rename would carry over to `API_TRAIT_CLEAR_RESOURCES_WITH_VIEWS`.)

This I agree with. It'd be best to change the trait to something that represents more clearly that Vulkan is capable of doing this. It might be more common that we find APIs that don't.

* Finally, there's the `thread_local LocalVector` technique. That's kind of a safer `alloca()`, isn't it? I guess we should discuss rules about when to prefer that over the `ALLOCA_ARRAY()` utility macro that drivers use, which may mean we let the drivers do the same, but in any case come up with something consistent.

Yes, the usage of thread_local is very deliberate across this and my other PRs as it's an effective way to avoid using stack memory and reusing an already allocated vector across calls while also allowing the method to remain safe when using multi-threading. It's important to note thread_local also implies the existence of static, so these vectors do not get re-allocated on each call and merely get reused.

This will be a pretty common pattern among some other optimization PRs that will be coming later. Considering there's no upper bound to some of these calls, relying on stack memory could become a problem at some point that'd leave no other possible alternative as a fix.

May 10 '24 18:05 DarioSamo

[...] but I went with whatever requires the least amount of knowledge from the reader. The graph is only created once per RenderingDevice so there's not much in the way of memory savings to be found.

I believe anyone hoping to understand this area of the engine is aware of bitfields. It's not a great saving, but why missing the opportunity of packing a few similarly-purposed booleans together?

May 13 '24 10:05 RandomShaper

I believe anyone hoping to understand this area of the engine is aware of bitfields. It's not a great saving, but why missing the opportunity of packing a few similarly-purposed booleans together?

It's not something I gave that much thought to be honest as it wouldn't make much of a difference and from my POV it seemed easier to understand. I have no reason to argue for or against it as there's no measurable performance impact. I'll go with whatever preference the project wants.

May 13 '24 12:05 DarioSamo

Yes, the usage of thread_local is very deliberate across this and my other PRs as it's an effective way to avoid using stack memory and reusing an already allocated vector across calls while also allowing the method to remain safe when using multi-threading. It's important to note thread_local also implies the existence of static, so these vectors do not get re-allocated on each call and merely get reused.

The technique is cool. I'd only like to raise the need of discussing its usage at some point in the future. Lately, I've been trying to remove or minimize runtime allocations in some PRs, and I don't quite feel good about we adding them to other area of the engine. I believe you are more experienced than me regarding the real impact of them in this context, so I'm all ears (or eyes in this context). My point is that something as thin as the rendering driver shouldn't be the culprit of a spike in frame time because it happened to need to invoke the allocator.

I have an idea (again not for now, not for this PR): we can use something similar to your technique, that would hit the sweet spot between alloc() and thread_local worst-case arrays. That would be using arena-backed arrays. The worst-case allocation would be a chunk of memory per frame-in-flight, used as an arena from where the arrays would be allocated. LocalVector could be extended to support custom allocators and we'd have an arena allocator class. That would have the safety of a proper array class while also being growable, but would cause much fewer runtime allocations (and a project setting could be used to set the initial size so users can tweak settings so their games don't ever need to allocate at runtime if they really want to do so and check the max. frame arena size the profiler would report).

UPDATE: It doesn't really need to be per-frame, since it's throwaway data, but the idea is mostly the same.

May 13 '24 17:05 RandomShaper

I'm not sure my feedback (the directly actionable part) has been addressed yet.

Also, this may be better rebased on top of #91416, which already includes the upgrade of the Agility SDK to 613, but in a more comprehensive manner.

May 15 '24 10:05 RandomShaper

I'm not sure my feedback (the directly actionable part) has been addressed yet.

Would you say this covers the copy engine suggestion and the bitfields? I feel the arena allocator would be out of scope for this PR until we get a common class we can reuse to adopt that strategy elsewhere.

May 15 '24 11:05 DarioSamo

I believe anyone hoping to understand this area of the engine is aware of bitfields. It's not a great saving, but why missing the opportunity of packing a few similarly-purposed booleans together?

Upon trying the suggestion, one annoying thing I realized is that bitfield declaration does not allow us to give it an initialization value on the class itself and instead requires to do a union to do so. Do we feel it's worth doing this:

union {
  struct {
    bool driver_honors_barriers : 1;
    bool driver_clear_resources_with_views : 1;
  }
  
  bool driver_all = false;
};

Instead of just this?

bool driver_honors_barriers = false;
bool driver_clear_resources_with_views = false;

The other alternative is just doing the initialization withing the class' initializer manually, but that breaks a bit of the style of the rest of the class where you can just read it directly to find out what the defaults are.

May 15 '24 12:05 DarioSamo

I've added the rename to use copy engine. We should decide about the booleans vs the bitfields for the reasons outlined above.

May 15 '24 14:05 DarioSamo

The lack of bitfield initializers is unfortunate indeed, but the style ship has already sailed in that regard. See this: https://github.com/godotengine/godot/blob/f4b047a084a5ecff153d0d32453aeb2e9919c83c/core/object/worker_thread_pool.h#L76-L99

So, until we upgrade to the version of C++ that allows in-place initializers for that, I'd suggest embracing that style as the convention.

To clarify, yes, that's indeed the last piece of feedback I was talking about. Everything else belongs indeed to future iterations, in any case.

May 17 '24 08:05 RandomShaper

I've addressed the bitfield suggestion.

May 20 '24 16:05 DarioSamo

As this is relatively big and I'm planning a 4.3-beta1 release this week, I'm scheduling to merge this after beta1 and for beta2, so we can hopefully find potential regressions before making the beta2 build. Or at least have a "stable" beta1 baseline to compare with if users report D3D12 or RenderGraph regressions in beta2.

May 21 '24 09:05 akien-mga

Thanks!

May 31 '24 12:05 akien-mga

godot godot copied to clipboard

Add support for enhanced barriers in D3D12.

godot
godot copied to clipboard