Ryujinx icon indicating copy to clipboard operation
Ryujinx copied to clipboard

GPU: Eliminate CB0 accesses when storage buffer accesses are resolved

Open riperiperi opened this issue 3 years ago • 4 comments

The zeroth constant is typically used by official drivers as a kind of "driver support buffer". Command buffers/macros load draw parameters in this buffer for the shader to use with the Constant Buffer Update methods, such as base vertex, base instance, and shader storage buffer addresses. The last one is used to "emulate" storage buffers, as it simply adds the address of the storage buffer to the offset provided.

The problem here is that when storage buffer bindings are in use, the macro attempts to update this for every draw. Most of the time it is the same value, sometimes it is different. But in the majority of cases, we shouldn't care, we already located where the global access wanted to go, so we don't even need to read it.

This PR identifies these cases and removes the base address from the offset calculation, so that it doesn't need to be read from cb0, allowing it to be removed from shader bindings in most cases. The initial version of this was written by @gdkchan, I extended it to run in more cases and added the specialization.

This has similar goals to redundant-cbu, which attempts to reduce the number of inline buffer updates, which reduces how much data is copied onto the gpu, reducing render pass splits and barriers placed around buffer updates. The main difference here is the approach. redundant-cbu attempts to reduce this by not syncing the data when macros make redundant changes, whereas this approach attempts to remove constant buffer 0 accesses entirely.

The benefit is that cb0 buffer bindings are entirely removed - the bindings don't need to update, the buffer doesn't have to be looked up, the buffer doesn't have to be checked for modifiation, and most importantly there do not need to be any updates. It is optimized out entirely. (though the macro still updates the memory)

The downside here is that any other access to cb0 that isn't eliminated will mean it appears as a binding. For example, if the base vertex is read from shader, and it was not removed... only dei-hle removes this right now. Thankfully, most games don't use these in the shader.

Alignment Failsafe

If - god forbid - a game were to break this alignment constraint, the shaders will be recompiled with all storage accesses using the offset calculation as before, and they will read cb0. This is a very unexpected case, so the additional shader variant cost is not really a concern, it's just for compatibility.

If the host doesn't have a 16 byte or lower storage buffer granularity, this is just disabled all the time, as it will just end up generating two versions of every shader.

Cases this doesn't help

In certain cases, the optimizer cannot eliminate an unused access. This seems to happen when the address is set in a conditional block, and the same variable is used later for something else. Maybe there's a way to generally improve this, but it seems rare.

The guest OpenGL driver seems to put a lot more in the cb0. redundant-cbu or mirrors are still required to remove render pass splits there, I'm not sure what these ones are so I'm not sure if they can be removed. If there's any access to cb0 apart from the ones we remove, then we still pay the price as if nothing was removed at all.

Draw/Instance IDs can be written by regular draw methods, not just the indirect ones. If these variables are used by a draw macro that we haven't HLE'd, then the constant buffer access for those will not be removed, and we'll pay the price again. HLE for more draw macros is probably the way forward, as it's the only way to verify that these cb0 offsets are being used in a way that can be optimized out.

Inline updates to other constant buffers, storage buffers or vertex/index buffers also are unaffected, and some games do it. For some of these redundant-cbu has no effect, so mirroring the accesses is the only way forward here.

HLE Macros cross compatibility

Since this overlaps a little with HLE Macros removing cb0 accesses for vertex/instance IDs, the space that PR uses for the has been reserved here.

Affected games

Generally the same as #3828, but can be a stronger effect, such as in Xenoblade Chronicles: Definitive Edition. This removes a bunch of uploads redundant-cbu can't, but also doesn't remove all the same ones. Having both removes the most cases.

Xenoblade Chronicles: Definitive Edition and Link's Awakening have the largest impact here. Note that the first still has unrelated performance issues when drawing a lot of characters.

Killing these updates has been a journey. It took like 4 branches, but we're eventually getting there.

Testing

Hopefully does not break anything - this does change every storage access.

riperiperi avatar Nov 14 '22 00:11 riperiperi

Previous tests: https://github.com/Ryujinx/Ryujinx/pull/3828#issuecomment-1310071773

Compared to redundant-cb2, provides further improvement to Xenoblade Chronicles: Definitive Edition on Vulkan (50 -> 65 -> 70) image

OpenGL also runs faster (75 -> around 80) image

These changes can stack, but there won't be much additional benefit in this game. Generally stacking them would benefit games where one applies and the other does not.

riperiperi avatar Nov 14 '22 00:11 riperiperi

Significant performance improvement in XB 3 and 2 with this PR

23>30

Mainline: image

PR: image

29>34

Mainline: image

PR: image

Bjorn29512 avatar Nov 14 '22 11:11 Bjorn29512

Feedback addressed. CommitBindings now refreshes the shader when any buffer becomes unaligned, or the reverse happens (in addition to the existing texture check). I also added a check to the compute dispatch which refetches the shader if this happens. Constant buffers are now bound after storage buffers on compute, as the bindings may change if buffers become unaligned (cb0 could reappear).

I think maybe in future, we should remove the burden of checking individual spec state (either after buffer/texture bindings, or the big one after state update) to fields in a "current spec state" being updated by each method directly with a dirty flag. This would avoid the specialization state check that happens every draw, and has notable impact on performance, as well as making it easier to check these bindings using a single flag rather than very specific checks.

riperiperi avatar Nov 16 '22 17:11 riperiperi