dolphin icon indicating copy to clipboard operation
dolphin copied to clipboard

VideoCommon: Add dynamic vertex loader for ubershaders to reduce pipeline count

Open TellowKrinkle opened this issue 2 years ago • 11 comments

Some fun things I've been working on, figured I'd post it here for other people to try out, even if it's not quite ready for merge

Adds a vertex loader to vertex ubershaders which removes the need to do anything with the input assembler, reducing the number of unique pipelines needed. This makes the number of different pipelines that actually require recompiles small enough that we can compile them all ahead of time on most renderers. In particular:

  • Metal: Completely eliminates pipeline compiles on all known GPUs
  • DX12: Completely eliminates pipeline compiles on the two GPUs I tested (Radeon Pro 5600M with weird AMD Mac bootcamp drivers and a GT 750M with its ancient Kepler Mobile driver)
  • Vulkan:
    • Completely eliminates pipeline compiles on MoltenVK on Intel, AMD, and Nvidia GPUs
    • Mostly eliminates pipeline compiles on MoltenVK on Apple GPUs
    • Mostly eliminates pipeline compiles on the two GPUs listed above on their Windows Vulkan drivers

The implementation puts the list of offsets and strides into the vertex uniform, which the shader uses to figure out where it should load from in an SSBO.

I'm not super confident in my Vulkan code and even less in my DX12 code, so if someone could look those over especially closely that would be nice.

Before and After recordings (Running with all pipeline and driver caches cleared, exclusive ubershaders, compile shaders before starting)

Before:

https://user-images.githubusercontent.com/3315070/175461508-3b744072-e4fd-491c-9aaa-00c83f6295b0.mp4

After (that one stutter is an EFB copy pipeline):

https://user-images.githubusercontent.com/3315070/175463003-7a272875-8bd2-47e1-8afd-e9057c6e028e.mp4

~~Note: Just waiting on #10747~~All ready

TellowKrinkle avatar Jun 24 '22 06:06 TellowKrinkle

Wouldn't it make more sense to use VK_EXT_vertex_input_dynamic_state on Vulkan when supported (Nvidia GPUs)?

K0bin avatar Jun 29 '22 10:06 K0bin

In the beginning of the title screen in your after video there's a bunch of glitchiness on the left side. Is that a preexisting issue that just didn't show up in the before video?

Dentomologist avatar Jun 29 '22 17:06 Dentomologist

I'm guessing that's an encoding issue based on the size of the pixels, so I just ignored it.

JMC47 avatar Jun 29 '22 18:06 JMC47

I think it would make more sense to make a PR without the Metal code.

Rumi-Larry avatar Jun 29 '22 19:06 Rumi-Larry

Wouldn't it make more sense to use VK_EXT_vertex_input_dynamic_state on Vulkan when supported (Nvidia GPUs)?

Probably, I admittedly wasn't really developing this for Vulkan. I'm happy to adjust the Vulkan implementation to check g_ActiveConfig.backend_info.bSupportsDynamicVertexLoader so that the whole thing can be flipped off by writing false into the variable when someone does implement VK_EXT_vertex_input_dynamic_state though. Maybe use a different flag to enable the additional pipelines in the VideoCommon: Compile a few extra pipelines commit as well? Or just always compile them, assuming it'll be really fast on OGL / DX11 since it doesn't actually involve any compilation there?

I'm guessing that's an encoding issue based on the size of the pixels, so I just ignored it.

I had to compress those pretty hard to get under GitHub's 10mb limit. Used the same (horribly low) quality settings on both but clearly that doesn't mean x264 will produce similar results. Can confirm that doesn't happen when you actually play.

I think it would make more sense to make a PR without the Metal code.

~~Just to confirm, you think that even though it would still include the spirv-cross PR code?~~ spirv-cross was merged, so I rebased off Metal

TellowKrinkle avatar Jun 30 '22 03:06 TellowKrinkle

@MayImilae Would be interesting to see if this fixes the problems with frame times you observed on Intel Macs.

OatmealDome avatar Jul 13 '22 20:07 OatmealDome

I haven't looked over this in detail, but it feels like a somewhat strange approach. Dolphin already has a system to reformat vertices from the GameCube's format to one that works well for modern GPUs (VertexLoaderX64/VertexLoaderARM64), though as you've found it generates different formats based on the original one. IMO it would be better to modify that so that a single vertex with all components is generated. That would mean using 156-byte vertices in all cases (even if the input vertex is just a color and a single texture coordinate), though.

Pokechu22 avatar Jul 17 '22 05:07 Pokechu22

The majority of vertices I've seen are in the 20-28 byte range, which would mean sending everything would be a 6-8x increase in vertex bytes, just to slightly reduce the complexity of ubershaders. Messing with stats reporting shows that it would be pretty common to see games hitting 20MB/frame or a bit over 1GB/s of vertices this way.

Side note, if you were hoping to keep vertex loaders out of shaders in general, I was planning on implementing vertex shader line to triangle expansion for platforms without geometry shaders, which will also require a shader vertex loader.

TellowKrinkle avatar Jul 17 '22 08:07 TellowKrinkle

Ah, Line-width without geometry shaders might actually be faster on Android. That's how slow they are.

JMC47 avatar Jul 17 '22 08:07 JMC47

It might be a bit simpler if there were, say, individual arrays for each component, some of which don't get populated, but I'm not sure if that would work or have any real benefit

Wouldn't that require binding like 15 different vertex buffers? Or were you thinking of some other way of accessing them all?

TellowKrinkle avatar Aug 03 '22 06:08 TellowKrinkle

It might be a bit simpler if there were, say, individual arrays for each component, some of which don't get populated, but I'm not sure if that would work or have any real benefit

Wouldn't that require binding like 15 different vertex buffers? Or were you thinking of some other way of accessing them all?

I don't have an exact idea in mind, but it probably would require binding 15 different buffers, yeah. The main benefit would be reducing the amount of calculation that needs to be done to find the index into those buffers since you would always know the correct buffer and stride for given component. But I'm not sure if that would actually be any faster (or if it's possible to avoid syncing data for buffers that aren't relevant), and it would require lots of other work.

Pokechu22 avatar Aug 05 '22 03:08 Pokechu22

Other than that things look mostly reasonable, though I don't really understand the D3D or Vulkan setup code. I don't see anything that looks obviously wrong, at least.

Pokechu22 avatar Sep 18 '22 04:09 Pokechu22