Lime-3DS-Emulator
Lime-3DS-Emulator copied to clipboard
video_core/shader: Optimize fragment shader by skipping passthrough TEV stages
This change adds a fast-path optimization in the fragment shader generator to detect and skip TEV stages that simply pass through their input unchanged. This reduces shader complexity and improves performance for common rendering cases where TEV stages are configured as passthrough.
The optimization checks for:
- Replace operation for both color and alpha
- Previous buffer as source
- No color/alpha modifiers
- Unity multipliers
This is a safe optimization as it preserves exact PICA behavior while reducing unnecessary shader instructions.
This change also increases performance in games like Luigi's Mansion: Dark Moon
Looks like the change is breaking some visuals, notice the fountain:
Here is how it should look like:
Looks like the change is breaking some visuals, notice the fountain:
Here is how it should look like:
Also the red bush
I'll take a look and see if I can fix it
All fixed
Got a little carried away there, I think I'm going to stop here with optimizations
Everything should be working now, I went and tested my library of games and there are no longer any graphical issues.
By the way, I have played the LM2 intro side by side on 2121.1 and the msys2 artifact from this build, and the vulkan shader stutter, with the cache cleaned up beforehand, seems to be LONGER (a few ms) in this PR than on 2121.1.
By the way, I have played the LM2 intro side by side on 2121.1 and the msys2 artifact from this build, and the vulkan shader stutter, with the cache cleaned up beforehand, seems to be LONGER (a few ms) in this PR than on 2121.1.
do you have at least a 3-3.5 ms render thread delay? you still need a delay, just much less. On my hardware, level D-1 of LM2 went from requiring a 9.5 ms delay on 2121.1 just to get to where the stuttering infrequent enough to be playable, to only needing a 3-4 ms delay to eliminate stuttering altogether
As per the project readme, don't repeatedly merge master into your branch. A maintainer will do it if/when necessary.
As per the project readme, don't repeatedly merge master into your branch. A maintainer will do it if/when necessary.
I did that because the PR that was recently merged had modified files that may have an effect on this PR
As per the project readme, don't repeatedly merge master into your branch. A maintainer will do it if/when necessary.
I did that because the PR that was recently merged had modified files that may have an effect on this PR
Former core Citra devs typically used git rebase followed by a force push to the remote branch. I'm not sure if this approach is recommended here.
Former core Citra devs typically used git rebase followed by a force push to the remote branch. I'm not sure if this approach is recommended here.
We typically do that as well. When people do merges we typically roll back the merge commit and then do a rebase.
Will this get merged?
It's still not clear if this PR gives any advantage.
Many of these changes should be separated out into their own PRs (Ex: the change over to fmt for string concat) to filter out any changes that negatively impact performance on some platforms. I have noted that sometimes a change will give some performance improvement on MSVC but hurt the MSYS2 build, or vice versa.
Many of these changes should be separated out into their own PRs (Ex: the change over to fmt for string concat) to filter out any changes that negatively impact performance on some platforms. I have noted that sometimes a change will give some performance improvement on MSVC but hurt the MSYS2 build, or vice versa.
Did you benchmark on Linux too?
Did you benchmark on Linux too?
My only usable Linux laptop isn't booting right now (BIOS chip corrupted, SOIC-8 test clip for external flashing broke), so I won't be able to compile and test on Linux until that is taken care of (my next newest Linux device is 17 years old, and I don't think it comes close to meeting any of the requirements to run Azahar).
Anyways, the varying performance is mostly based on the compiler from what I've tested, so besides differences between graphics drivers on Linux and Windows, the performance should be similar when using the same compiler. I am curious if it is something in this PR causing MSYS2 performance to be so inconsistent or if it is just weird compiler inconsistencies. I generally compile with MSVC locally because it's faster (and I kept breaking my MSYS2 setup when playing around with it), so I only really am able to test the MSYS2 builds whenever I push my code and GitHub Actions compiles it for me.
Based on my experience, MSVC offers the fastest compilation times and the most convenient debugging process. However, in terms of runtime performance, binaries compiled with MSYS2 and Clang can achieve 5-15% higher efficiency on modern CPUs, particularly for computationally intensive applications. A notable example is the RPCS3 Windows build using Clang, which demonstrates significantly improved frame rates over the MSVC version on my Zen4 system.
In my experience (at least with the code in this PR, and from the tests I did months ago now), MSYS2 seems to require a longer render thread delay than MSVC for Luigi's Mansion 2 / Dark Moon, but both versions still are able to use a much smaller delay than stock to eliminate any noticeable stutter. I don't remember if the difference in render thread delays are also present outside this PR, though.