Ryujinx icon indicating copy to clipboard operation
Ryujinx copied to clipboard

Implement shader storage buffer operations using new Load/Store instructions

Open gdkchan opened this issue 1 year ago • 12 comments

Overview

Depends on #4714.

This is the continuation on the changes to make the shader IR more flexible in general, to allow the implementation of things like transform feedback and geometry shader emulation on mac, without the need of ugly "hacks" to make it fit the current model.

Unlike the other changes though, this one actually changes how the global memory elimination works in a quite significant way. So let's start with the problem. NVIDIA GPUs can access pretty much any GPU mapped memory with the use of LDG and STG instructions (Load Global and Store Global). Storage buffers (which is how higher level APIs provides read and write access to buffers) is implemented by loading the base pointer from a constant buffer, and then adding an offset. Global memory elimination is a pass that the emulator does to find the base address for storage buffer from LDG/STG instructions. It then transforms those accesses on regular storage buffer accesses.

The way how global memory elimination works in right now (without this change), is that it tries to find a constant buffer from the chain of operations that results into the global memory address. If it fails to find a constant buffer, there is a second pass that will generate code to compare that address with all storage buffer base address and sizes that are accessed on the shader, and use the first one that is in bounds. The main problem with this approach comes when we need to do actual global memory emulation, for cases where the base pointer doesn't come from a buffer, but instead from some other unpredictable place. For those cases, we need to be able to tell whenever the base pointer comes from the constant buffer, or somewhere else.

The new approach gets rid of the second pass and does everything in a single pass. We can divide it in 3 cases:

  • Inline: This is the most simple case where we can generate the storage buffer operation inline, replacing the global memory operation. It can also use the offset directly, which means there's no need to load the base pointer and the constant buffer 0 (CB0) (or whichever constant buffer it loads the pointer from) can be eliminated too, which avoids the need to update the buffer.
  • Single target function: Pretty much like the above, but it calls a function instead of generating the code inline. This is for more complex cases, like store to byte or short types, which requires a compare and swap loop as we need to partially update the 32-bit integer value. GLSL already had helper function for those cases, the difference is that now we can generate those functions on the IR instead of manually writing the GLSL on the backend (and doing inline code generation on SPIR-V).
  • Multi target function: This is for cases where it can't find the base pointer location directly. In this mode it walks trough Phi node chains and tries to find all possible base pointer locations. It then generates a function that will to bounds checking and try finding the correct storage buffer, then do the operation on that buffer. The main difference from the old approach is that hte old method would check all possible locations that could have the base pointer, while this will only check the ones it can find. This has the advantage of generating more efficient code, and working with full global memory emulation (as we can tell apart cases where global memory elimination failed due to Phi nodes being in the way VS cases where games are passing and manipulating pointers on the shader directly). It has the disadvantage that the chances of it still not being able to find the correct location being way higher.

As part of this, I also made some changes to make global cases elimination work in more cases (since it not working is more catastrophic with those changes).

Some cases that are now handled:

  • The compiler will sometimes store the base pointer on a per-invocation shared memory location, and then load it later. I think it is just some temporary storage that it uses when it runs out of registers (but it doesn't make a lot of sense, it could also just load the value from the constant buffer again...). I found that on Zelda Breath of the Wild, so I added some basic tracking for shared memory store/load pairs.
  • The compiler generates XOR swap sequences to swap the value between two registers in some cases. This was getting in the way as global access elimination was not able to follow through the XOR sequence. So I made the optimizer turn those into regular copies which can be eliminated and don't get in the way. I found that one on Tokyo Mirage Sessions.
  • It was not able to match the base address + constant + constant pattern, because it was only expecting one constant. I simply changed the LDG and STG code generation to only ever add by one constant instead of two. An example that would generate that before, is LDG.64 with non-zero offset. The additional constant comes from the second element offset (+ 4). Found on Monster Hunter Rise.

There might be more cases that are not handled yet (I hope not).

Other notable changes:

  • The uniform buffer emulation transform that transforms global memory access with base address from the constant buffer region that NVN uses to store the address for constant buffers has been removed. This is not needed anymore since we now allow the base address to come from pretty much any constant buffer location. Downsides is that cases where the total amount of bindings exceeds 16 (which is usually the host limit) will not work. Another potential downside is that storage buffer access is slower than uniform buffer access. But it's unlikely to be a problem on Maxwell hardware as they only support 8 uniform buffers on compute anyway. I don't know if newer NVIDIA GPUs increased the amount of compute constant buffers.
  • Storage buffer alignment check was changed. Before the unaligned case was triggered if the address was not a multiple of 16, or if the host alignment was greater than 16. Now it just checks if the address is a multiple of the host alignment. This change allows the faster aligned access to also be used on Intel, where the alignment is 64.

What to test

As mentioned above, this change is more prone to breaking than the previous ones. So it needs general testing to ensure there's no regressions. When global access elimination fails, a message like Failed to find storage buffer for global memory operation will be printed in the log. All games where this message appears should be reported here. The only cases where this is expected to appear is on games that are currently broken due to global memory, like Nintendo Switch Sports.

gdkchan avatar May 17 '23 05:05 gdkchan

00:00:33.262 |W| GPU.AsyncTranslationThread.6 Gpu Log: Shader translator: Failed to find storage buffer for global memory operation "Store".

TOTK is another game that does it (They both are using the same engine so it makes sense i guess, probably Splatoon 3 too)

GamerzHell9137 avatar May 17 '23 12:05 GamerzHell9137

00:00:33.262 |W| GPU.AsyncTranslationThread.6 Gpu Log: Shader translator: Failed to find storage buffer for global memory operation "Store".

TOTK is another game that does it (They both are using the same engine so it makes sense i guess, probably Splatoon 3 too)

Where exactly does it do that?

gdkchan avatar May 17 '23 14:05 gdkchan

00:00:33.262 |W| GPU.AsyncTranslationThread.6 Gpu Log: Shader translator: Failed to find storage buffer for global memory operation "Store". TOTK is another game that does it (They both are using the same engine so it makes sense i guess, probably Splatoon 3 too)

Where exactly does it do that?

When booting the game, just before the main menu screen.

GamerzHell9137 avatar May 17 '23 14:05 GamerzHell9137

This might be happening while rebuilding the shaders for TOTK, can confirm for sure that it shows that when rebuilding shaders for Splatoon 3.

image

GamerzHell9137 avatar May 17 '23 14:05 GamerzHell9137

The next games while rebuilding shaders say

00:00:11.331 |W| GPU.AsyncTranslationThread.2 Gpu Log: Shader translator: Failed to find storage buffer for global memory operation "Store".

Animal Crossing TOTK Splatoon 3

GamerzHell9137 avatar May 17 '23 14:05 GamerzHell9137

The next games while rebuilding shaders say

00:00:11.331 |W| GPU.AsyncTranslationThread.2 Gpu Log: Shader translator: Failed to find storage buffer for global memory operation "Store".

Animal Crossing TOTK Splatoon 3

Can you try again? I updated it to catch a few more cases of global memory access on compute.

I tested it with 17000 shaders from many different games, and all remaining failures are more or less expected:

  • On a few compute shaders, the constant buffer from where the base pointter comes from doesn't have an offset that is a multiple of 16. This is not allowed right now, but I plan to change the requirement to 8 bytes once #4999 is merged.
  • In some cases there are 2 possible constant buffers, so there's no way to know from which one the base pointer comes from. It picks one where the offset is not aligned and fails, but regardless of what it picks, it could be wrong.
  • Weird global memory access on tessellation shaders, likely compiler generated. That one is from Mario Golf Super Rush.
  • Base pointer comes from vertex buffer. That one is from FIFA.
  • Base pointer comes from storage buffer. I think it's from FIFA too.
  • Nintendo Switch Sports indirect data clear shader.
  • Base pointer comes from constant buffer, but offset is not constant. This is from Re: Zero, this game uses Vulkan, it's likely something compiler generated.

gdkchan avatar May 17 '23 20:05 gdkchan

The next games while rebuilding shaders say

00:00:11.331 |W| GPU.AsyncTranslationThread.2 Gpu Log: Shader translator: Failed to find storage buffer for global memory operation "Store".

Animal Crossing TOTK Splatoon 3

Can you try again? I updated it to catch a few more cases of global memory access on compute.

I tested it with 17000 shaders from many different games, and all remaining failures are more or less expected:

* On a few compute shaders, the constant buffer from where the base pointter comes from doesn't have an offset that is a multiple of 16. This is not allowed right now, but I plan to change the requirement to 8 bytes once [GPU: Avoid using garbage size for non-cb0 storage buffers #4999](https://github.com/Ryujinx/Ryujinx/pull/4999) is merged.

* In some cases there are 2 possible constant buffers, so there's no way to know from which one the base pointer comes from. It picks one where the offset is not aligned and fails, but regardless of what it picks, it could be wrong.

* Weird global memory access on tessellation shaders, likely compiler generated. That one is from Mario Golf Super Rush.

* Base pointer comes from vertex buffer. That one is from FIFA.

* Base pointer comes from storage buffer. I think it's from FIFA too.

* Nintendo Switch Sports indirect data clear shader.

* Base pointer comes from constant buffer, but offset is not constant. This is from Re: Zero, this game uses Vulkan, it's likely something compiler generated.

Only Splatoon 3 is showing the message now.

image

GamerzHell9137 avatar May 17 '23 21:05 GamerzHell9137

is there an .exe or a build that i can test with these changes too ?

eXxe-elcior avatar May 17 '23 22:05 eXxe-elcior

Only Splatoon 3 is showing the message now.

I looked at the Splatoon 3 failures. It seems to be the same shader that Nintendo Switch Sports uses for clears. But it doesn't seems to be causing any issues on Splatoon 3. So the failures on this game seems to be expected too (at least for the shader that I found, which has 5 STG instructions with the base pointer from another storage buffer).

        /*0008*/         {         MOV32I R2, 0x1 ;
        /*0010*/                   ALD R0, a[0x2fc]         }
        /*0018*/                   ISCADD R0.CC, R0, c[0x0][0x110], 0x3 ;
        /*0028*/                   IADD.X R1, RZ, c[0x0][0x114] ;
        /*0030*/                   LDG.E.64 R0, [R0] ;
        /*0038*/                   STG.E [R0+0x4], R2 ;
        /*0048*/                   STG.E [R0], RZ ;
        /*0050*/                   STG.E [R0+0x8], RZ ;
        /*0058*/                   STG.E [R0+0xc], RZ ;
        /*0068*/                   STG.E [R0+0x10], RZ ;
        /*0070*/                   EXIT ;

is there an .exe or a build that i can test with these changes too ?

Just wait until the branch is rebased, then the bot will post the build links.

gdkchan avatar May 17 '23 23:05 gdkchan

Ready for review now. As I said on the main post, this is more likely to cause issues than the previous one, so will need some testing.

gdkchan avatar May 25 '23 20:05 gdkchan

Tested everything in my library. No issues :) Nvidia GPU used.

MutantAura avatar Jun 01 '23 20:06 MutantAura

I've been playing TOTK on an M1 Pro before and after this merge, and I feel like theres been more funny artifacts/stutters/visual hiccups with this merged into the main branch. Are there any specific logs that would be helpful that I can provide to help investigate?

tgcordell avatar Jun 05 '23 02:06 tgcordell

The only new log that this change produces is the "Failed to find storage buffer for global memory operation" that you can see on the comments above.

gdkchan avatar Jun 05 '23 02:06 gdkchan