Lime-3DS-Emulator icon indicating copy to clipboard operation
Lime-3DS-Emulator copied to clipboard

video_core/shader: Optimize fragment shader by skipping passthrough TEV stages

Open jbm11208 opened this issue 6 months ago • 8 comments
trafficstars

This change adds a fast-path optimization in the fragment shader generator to detect and skip TEV stages that simply pass through their input unchanged. This reduces shader complexity and improves performance for common rendering cases where TEV stages are configured as passthrough.

The optimization checks for:

  • Replace operation for both color and alpha
  • Previous buffer as source
  • No color/alpha modifiers
  • Unity multipliers

This is a safe optimization as it preserves exact PICA behavior while reducing unnecessary shader instructions.

This change also increases performance in games like Luigi's Mansion: Dark Moon

jbm11208 avatar May 10 '25 22:05 jbm11208

Looks like the change is breaking some visuals, notice the fountain: azahar_2QkakE2WBw

Here is how it should look like: azahar_HvMI3XcEx3

PabloMK7 avatar May 11 '25 10:05 PabloMK7

Looks like the change is breaking some visuals, notice the fountain: azahar_2QkakE2WBw

Here is how it should look like: azahar_HvMI3XcEx3

Also the red bush

Algiuxs avatar May 11 '25 14:05 Algiuxs

I'll take a look and see if I can fix it

jbm11208 avatar May 11 '25 15:05 jbm11208

All fixed

jbm11208 avatar May 11 '25 21:05 jbm11208

Got a little carried away there, I think I'm going to stop here with optimizations

jbm11208 avatar May 12 '25 06:05 jbm11208

Everything should be working now, I went and tested my library of games and there are no longer any graphical issues.

jbm11208 avatar May 16 '25 17:05 jbm11208

By the way, I have played the LM2 intro side by side on 2121.1 and the msys2 artifact from this build, and the vulkan shader stutter, with the cache cleaned up beforehand, seems to be LONGER (a few ms) in this PR than on 2121.1.

PabloMK7 avatar May 19 '25 19:05 PabloMK7

By the way, I have played the LM2 intro side by side on 2121.1 and the msys2 artifact from this build, and the vulkan shader stutter, with the cache cleaned up beforehand, seems to be LONGER (a few ms) in this PR than on 2121.1.

do you have at least a 3-3.5 ms render thread delay? you still need a delay, just much less. On my hardware, level D-1 of LM2 went from requiring a 9.5 ms delay on 2121.1 just to get to where the stuttering infrequent enough to be playable, to only needing a 3-4 ms delay to eliminate stuttering altogether

jbm11208 avatar May 19 '25 20:05 jbm11208

As per the project readme, don't repeatedly merge master into your branch. A maintainer will do it if/when necessary.

OpenSauce04 avatar May 26 '25 16:05 OpenSauce04

As per the project readme, don't repeatedly merge master into your branch. A maintainer will do it if/when necessary.

I did that because the PR that was recently merged had modified files that may have an effect on this PR

jbm11208 avatar May 26 '25 16:05 jbm11208

As per the project readme, don't repeatedly merge master into your branch. A maintainer will do it if/when necessary.

I did that because the PR that was recently merged had modified files that may have an effect on this PR

Former core Citra devs typically used git rebase followed by a force push to the remote branch. I'm not sure if this approach is recommended here.

Dragios avatar Jun 01 '25 10:06 Dragios

Former core Citra devs typically used git rebase followed by a force push to the remote branch. I'm not sure if this approach is recommended here.

We typically do that as well. When people do merges we typically roll back the merge commit and then do a rebase.

OpenSauce04 avatar Jun 01 '25 11:06 OpenSauce04

Will this get merged?

Algiuxs avatar Aug 28 '25 13:08 Algiuxs

It's still not clear if this PR gives any advantage.

PabloMK7 avatar Aug 28 '25 13:08 PabloMK7

Many of these changes should be separated out into their own PRs (Ex: the change over to fmt for string concat) to filter out any changes that negatively impact performance on some platforms. I have noted that sometimes a change will give some performance improvement on MSVC but hurt the MSYS2 build, or vice versa.

jbm11208 avatar Nov 11 '25 15:11 jbm11208

Many of these changes should be separated out into their own PRs (Ex: the change over to fmt for string concat) to filter out any changes that negatively impact performance on some platforms. I have noted that sometimes a change will give some performance improvement on MSVC but hurt the MSYS2 build, or vice versa.

Did you benchmark on Linux too?

Algiuxs avatar Nov 12 '25 09:11 Algiuxs

Did you benchmark on Linux too?

My only usable Linux laptop isn't booting right now (BIOS chip corrupted, SOIC-8 test clip for external flashing broke), so I won't be able to compile and test on Linux until that is taken care of (my next newest Linux device is 17 years old, and I don't think it comes close to meeting any of the requirements to run Azahar).

Anyways, the varying performance is mostly based on the compiler from what I've tested, so besides differences between graphics drivers on Linux and Windows, the performance should be similar when using the same compiler. I am curious if it is something in this PR causing MSYS2 performance to be so inconsistent or if it is just weird compiler inconsistencies. I generally compile with MSVC locally because it's faster (and I kept breaking my MSYS2 setup when playing around with it), so I only really am able to test the MSYS2 builds whenever I push my code and GitHub Actions compiles it for me.

jbm11208 avatar Nov 12 '25 18:11 jbm11208

Based on my experience, MSVC offers the fastest compilation times and the most convenient debugging process. However, in terms of runtime performance, binaries compiled with MSYS2 and Clang can achieve 5-15% higher efficiency on modern CPUs, particularly for computationally intensive applications. A notable example is the RPCS3 Windows build using Clang, which demonstrates significantly improved frame rates over the MSVC version on my Zen4 system.

crashGG avatar Nov 13 '25 05:11 crashGG

In my experience (at least with the code in this PR, and from the tests I did months ago now), MSYS2 seems to require a longer render thread delay than MSVC for Luigi's Mansion 2 / Dark Moon, but both versions still are able to use a much smaller delay than stock to eliminate any noticeable stutter. I don't remember if the difference in render thread delays are also present outside this PR, though.

jbm11208 avatar Nov 13 '25 16:11 jbm11208