dolphin
dolphin copied to clipboard
VideoCommon: Add dynamic vertex loader for ubershaders to reduce pipeline count
Some fun things I've been working on, figured I'd post it here for other people to try out, even if it's not quite ready for merge
Adds a vertex loader to vertex ubershaders which removes the need to do anything with the input assembler, reducing the number of unique pipelines needed. This makes the number of different pipelines that actually require recompiles small enough that we can compile them all ahead of time on most renderers. In particular:
- Metal: Completely eliminates pipeline compiles on all known GPUs
- DX12: Completely eliminates pipeline compiles on the two GPUs I tested (Radeon Pro 5600M with weird AMD Mac bootcamp drivers and a GT 750M with its ancient Kepler Mobile driver)
- Vulkan:
- Completely eliminates pipeline compiles on MoltenVK on Intel, AMD, and Nvidia GPUs
- Mostly eliminates pipeline compiles on MoltenVK on Apple GPUs
- Mostly eliminates pipeline compiles on the two GPUs listed above on their Windows Vulkan drivers
The implementation puts the list of offsets and strides into the vertex uniform, which the shader uses to figure out where it should load from in an SSBO.
I'm not super confident in my Vulkan code and even less in my DX12 code, so if someone could look those over especially closely that would be nice.
Before and After recordings
(Running with all pipeline and driver caches cleared, exclusive ubershaders, compile shaders before starting)Before:
https://user-images.githubusercontent.com/3315070/175461508-3b744072-e4fd-491c-9aaa-00c83f6295b0.mp4
After (that one stutter is an EFB copy pipeline):
https://user-images.githubusercontent.com/3315070/175463003-7a272875-8bd2-47e1-8afd-e9057c6e028e.mp4
~~Note: Just waiting on #10747~~All ready
Wouldn't it make more sense to use VK_EXT_vertex_input_dynamic_state on Vulkan when supported (Nvidia GPUs)?
In the beginning of the title screen in your after video there's a bunch of glitchiness on the left side. Is that a preexisting issue that just didn't show up in the before video?
I'm guessing that's an encoding issue based on the size of the pixels, so I just ignored it.
I think it would make more sense to make a PR without the Metal code.
Wouldn't it make more sense to use VK_EXT_vertex_input_dynamic_state on Vulkan when supported (Nvidia GPUs)?
Probably, I admittedly wasn't really developing this for Vulkan. I'm happy to adjust the Vulkan implementation to check g_ActiveConfig.backend_info.bSupportsDynamicVertexLoader
so that the whole thing can be flipped off by writing false
into the variable when someone does implement VK_EXT_vertex_input_dynamic_state
though. Maybe use a different flag to enable the additional pipelines in the VideoCommon: Compile a few extra pipelines
commit as well? Or just always compile them, assuming it'll be really fast on OGL / DX11 since it doesn't actually involve any compilation there?
I'm guessing that's an encoding issue based on the size of the pixels, so I just ignored it.
I had to compress those pretty hard to get under GitHub's 10mb limit. Used the same (horribly low) quality settings on both but clearly that doesn't mean x264 will produce similar results. Can confirm that doesn't happen when you actually play.
I think it would make more sense to make a PR without the Metal code.
~~Just to confirm, you think that even though it would still include the spirv-cross PR code?~~ spirv-cross was merged, so I rebased off Metal
@MayImilae Would be interesting to see if this fixes the problems with frame times you observed on Intel Macs.
I haven't looked over this in detail, but it feels like a somewhat strange approach. Dolphin already has a system to reformat vertices from the GameCube's format to one that works well for modern GPUs (VertexLoaderX64
/VertexLoaderARM64
), though as you've found it generates different formats based on the original one. IMO it would be better to modify that so that a single vertex with all components is generated. That would mean using 156-byte vertices in all cases (even if the input vertex is just a color and a single texture coordinate), though.
The majority of vertices I've seen are in the 20-28 byte range, which would mean sending everything would be a 6-8x increase in vertex bytes, just to slightly reduce the complexity of ubershaders. Messing with stats reporting shows that it would be pretty common to see games hitting 20MB/frame or a bit over 1GB/s of vertices this way.
Side note, if you were hoping to keep vertex loaders out of shaders in general, I was planning on implementing vertex shader line to triangle expansion for platforms without geometry shaders, which will also require a shader vertex loader.
Ah, Line-width without geometry shaders might actually be faster on Android. That's how slow they are.
It might be a bit simpler if there were, say, individual arrays for each component, some of which don't get populated, but I'm not sure if that would work or have any real benefit
Wouldn't that require binding like 15 different vertex buffers? Or were you thinking of some other way of accessing them all?
It might be a bit simpler if there were, say, individual arrays for each component, some of which don't get populated, but I'm not sure if that would work or have any real benefit
Wouldn't that require binding like 15 different vertex buffers? Or were you thinking of some other way of accessing them all?
I don't have an exact idea in mind, but it probably would require binding 15 different buffers, yeah. The main benefit would be reducing the amount of calculation that needs to be done to find the index into those buffers since you would always know the correct buffer and stride for given component. But I'm not sure if that would actually be any faster (or if it's possible to avoid syncing data for buffers that aren't relevant), and it would require lots of other work.
Other than that things look mostly reasonable, though I don't really understand the D3D or Vulkan setup code. I don't see anything that looks obviously wrong, at least.