pcsx2 icon indicating copy to clipboard operation
pcsx2 copied to clipboard

Spin GPU during readbacks

Open TellowKrinkle opened this issue 2 years ago • 12 comments

Description of Changes

Adds an option to submit pointless work to GPUs while they're waiting for work after a readback completes (Currently supported on Vulkan and Metal)

The Vulkan implementation optionally uses VK_EXT_calibrated_timestamps. On GPUs that don't support async compute (Intel), this is almost required to not have really bad performance characteristics, but GPUs with async compute should be fairly OK either way. Regardless, we display a warning on the first readback if VK_EXT_calibrated_timestamps isn't supported. Most drivers support VK_EXT_calibrated_timestamps, but notably, MoltenVK does not. It just so happens that timestampPeriod reporting is also broken on MoltenVK+Intel, effectively disabling the feature anyways, so I doubt anyone will experience this slowdown.

Some notes on the Vulkan implementation: I wasn't sure whether to put the code in Common::Vulkan::Context or GSDeviceVK. It requires resources like buffers and pipelines, which were previously only managed in GSDeviceVK, but also requires access to timestamp data, which was previously only accessed in Common::Vulkan::Context. In the end I put it in Common::Vulkan::Context, but am open to other opinions on where this should go.

I'm also not counting spin time for the GPU usage counter, which means there may be disagreements between Task Manager and PCSX2 on GPU usage.

Finally, I wasn't sure whether to put the checkbox in renderer or advanced settings. For now it's in the renderer settings.

Rationale behind Changes

A lot of laptop GPUs go into power saving modes when they're not being fully utilized, and because we don't submit new work immediately after a readback, this often makes the GPU think it can save power when it really shouldn't. Seeing noticeable improvements on:

  • Radeon Pro 5600M on Windows
  • Intel UHD 630 on MacOS
Comparison Screenshots

Valkyrie Profile 2 @ 1x, Spin Off, Radeon Pro 5600M (Windows) image Valkyrie Profile 2 @ 1x, Spin On, Radeon Pro 5600M (Windows) image SSX Tricky Snowdream @ 4x, Spin Off, Radeon Pro 5600M (Windows) image SSX Tricky Snowdream @ 4x, Spin On, Radeon Pro 5600M (Windows) image Valkyrie Profile 2 @ 1x, Spin Off, Intel UHD 630 (macOS Metal) image Valkyrie Profile 2 @ 1x, Spin On, Intel UHD 630 (macOS Metal) image

Suggested Testing Steps

Test games that use readbacks. Not GSdumps, they don't have the same performance characteristics. Some steam deck numbers would be nice, since they've previously had a lot of issues with readbacks. Also Intel GPUs on Windows or Linux, I've only been able to test async-compute-supporting GPUs on the Vulkan implementation, which are much more forgiving in spin overruns. Power usage comparisons would be nice too, I expect high power usage but the question is how high.

Any game is fine, monitor the OSD Statistics for 1 RB (or more than 1). Some games I know benefit:

  • Valkyrie Profile 2, outside right after starting the game (see comparison screenshots above)
  • SSX Tricky, Snowdream
  • Gran Turismo 4, Night stages

TellowKrinkle avatar Sep 20 '22 04:09 TellowKrinkle

Helps with Justice League

Without: image

With: image

refractionpcsx2 avatar Sep 20 '22 07:09 refractionpcsx2

I'd like to test Gran Turismo 4 on this build, since the RB thing was a major issue for me in terms of playability. How do I get a copy?

Danke-Boi avatar Sep 20 '22 13:09 Danke-Boi

Click on checks, then Windows Builds then scroll down if you have to and download one of the builds.

F0bes avatar Sep 20 '22 13:09 F0bes

@Danke-Boi go to "checks" and then to your OS and download the build :)

Mrlinkwii avatar Sep 20 '22 13:09 Mrlinkwii

This failed to make any difference in GT4. I played the game in the latest nightly build, which does not include this PR, and I got 29-30 FPS when RB was 1, and I tried this PR, and enabled the spin option, and got 29-30FPS when RB was 1 (in the menu before the drag strip in Gran Turismo mode, under the speed and power section). Now, there is some pretty bad shaking, but the performance is the same, and aside from the shaking, there are no visual artifacts.

I will say, the performance in this game from 1 year ago until today is mind boggling. On my M1 MacBook Air, with the 7core GPU, and 8GB of RAM, I was barely capable of 1080p upscaling. I can now comfortably do so, with no frame drops besides when RB is 1, which is why this is of such interest to me.

Danke-Boi avatar Sep 20 '22 23:09 Danke-Boi

@Danke-Boi did you make sure to tick the new option and use Vulkan?

refractionpcsx2 avatar Sep 20 '22 23:09 refractionpcsx2

Helps with Gran Turismo 4. I haven't checked the notorious George V Paris track yet, but all other night tracks shows noticeable improvements.

Nvidia 930M @ 1x, Arch Linux (Vulkan)

Without: spin gpu off

With: spin gpu on

carothersmarx avatar Sep 21 '22 00:09 carothersmarx

Hello! I'm sorry to ask, but is this change also supposed to fix performance issues in non-mobile GPUs? I ask because I have a GTX 1660 Ti for desktop and I was having performance issues with the Dog's Life game. There were some frame drops both in the title screen and in the save file selection screen and this PR seems to have fixed that. I've recorded a short video showing the difference, here: https://drive.google.com/file/d/1-0aF_n95XHdaPqmROJm-Vr4yoxJ-_eqe/view?usp=sharing

If so, maybe the tool-tip for that could be rewritten to possibly state the advantages on any GPUs?

Also, I've noticed this PR has MTVU enabled as default and this does not occur with my main build which I've always used and updated with the internal PCSX2 updater. The issue is that MTVU on does make a difference in this game to produce these results. So perhaps you could address that in an update by turning MTVU on as default for there may be others using the updater for quite some time now and they may also be unaware that this option is now recommended by default in newer builds. Or you could show an on-screen message as you did recently with the Texture preloading option (check refraction's screenshots). By the way, I had to figure out which option I had to use for Texture Preloading by myself because the tool-tip is not so clear about the difference.

Thank you for your time and work.

dezraj avatar Sep 21 '22 06:09 dezraj

Also, I've noticed this PR has MTVU enabled as default and this does not occur with my main build

MTVU is on by default it's only disabled if you have an older config that didn't have it enabled.

JordanTheToaster avatar Sep 21 '22 10:09 JordanTheToaster

@Danke-Boi did you make sure to tick the new option and use Vulkan?

@refractionpcsx2 Yes, but I used Metal, given the

(Currently supported on Vulkan and Metal)

No change after checking the box on Metal, I didn't try Vulkan, I can test later tonight, but I doubt it will do anything.

I KNOW WHY IT WON'T WORK ON M1 Apple has a terrible boosting algorithm, masked by the efficiency of the chip. So it doesn't go into power saving mode while waiting for stuff to happen. Even though it is a mobile chip. This means that this won't benefit Apple Silicon devices. (Until Apple optimizes boosting, although I doubt they want to, because of their history with Intel Turbo Boost). Especially considering that the M1 runs at full power on and off of the charger, unlike most x86 based computers.

Danke-Boi avatar Sep 21 '22 12:09 Danke-Boi

@Danke-Boi if that's true, then it possibly won't affect you, as this is solely to stop GPU's clocking down and idling when doing the download to avoid the ramp up and thread spin up when waiting at the end of a read back.

It seems to work on Windows on my desktop computer (in the case I showed above), but doesn't affect everything, some games saw zero change.

refractionpcsx2 avatar Sep 21 '22 13:09 refractionpcsx2

Also, I've noticed this PR has MTVU enabled as default and this does not occur with my main build

MTVU is on by default it's only disabled if you have an older config that didn't have it enabled.

Thank you for your response.

Yes, I came to realize that thanks to this testing. If you don't have MTVU on in the current nightly build the performance in Dog's Life will be much worse, which was my case. I had to compare every option in this PR with my main build to finally find the difference in the MTVU setting, which is now on by default.

As I mentioned before, you have also changed the default setting for "Texture preloading" but here a message is shown to hint the user about that.

Now I'm a little concerned about the updater trustfulness in keeping the program as 100% improved as the developers intend. Or maybe the MTVU silent change was an absolute exception?

Anyway, I just don't want to have my current configuration outdated again using the updater.

Thank you for your time.

dezraj avatar Sep 21 '22 14:09 dezraj

I KNOW WHY IT WON'T WORK ON M1

Last time I borrowed someone's M1 to test, I got a 50% improvement in GT4 from spinning, though that was with a different implementation (still available on my MacOS branch if you want to test)

Please use asitop / powermetrics to verify that GPU usage is high and clocks are high.

TellowKrinkle avatar Sep 21 '22 17:09 TellowKrinkle

Screenshot 2022-09-22 at 10 05 52 AM This is the power draw when RB = 0 and spin is on, note that it is at 85% utilization and is drawing 1000 mW more Screenshot 2022-09-22 at 10 05 29 AM power draw when RB = 1 and spin is on. These images aren't backwards, the GPU is drawing less power, but is at 98% utilization. I will test that other build here shortly @tellowkrinkle

Danke-Boi avatar Sep 22 '22 14:09 Danke-Boi

I KNOW WHY IT WON'T WORK ON M1

Last time I borrowed someone's M1 to test, I got a 50% improvement in GT4 from spinning, though that was with a different implementation (still available on my MacOS branch if you want to test)

Please use asitop / powermetrics to verify that GPU usage is high and clocks are high.

Here is the MacOS build, settings the same, including the spin option being enabled, same behaviour. Screenshot 2022-09-22 at 10 14 07 AM Here is when RB = 1 ^ Screenshot 2022-09-22 at 10 14 33 AM RB = 0 ^ I have the 7 core GPU, so it is possible I am GPU limited by rendering at 1080p, although I doubt it. It could also be a RAM issue, I have 8GB of RAM, not 16. It isn't a throttling issue, as you saw.

Danke-Boi avatar Sep 22 '22 14:09 Danke-Boi

Do you have Xcode installed? If so, can you open Instruments and record a Metal System Trace while RB is 1 and spin is enabled?

Also, do try lower resolution. Your GPU is at 65% to hold 60fps with RB 0, and already at high power, and RBs force the GPU and CPU to share time, so when RB 1 happens, the CPU would have to finish everything in just 35%, which probably isn't happening regardless of how much spinning happens.

TellowKrinkle avatar Sep 22 '22 16:09 TellowKrinkle

Do you have Xcode installed? If so, can you open Instruments and record a Metal System Trace while RB is 1 and spin is enabled? Also, do try lower resolution. Your GPU is at 65% to hold 60fps with RB 0, and already at high power, and RBs force the GPU and CPU to share time, so when RB 1 happens, the CPU would have to finish everything in just 35%, which probably isn't happening regardless of how much spinning happens.

How do I do that exactly? I have it downloaded, what do I do to get to Instruments?

I won't let me edit that comment, but I got it. I am uploading it to Google Drive, since GitHub doesn't like the file https://drive.google.com/file/d/1rew9Hdi1tckw1nqUKWiTqbPWh45tiYg0/view?usp=sharing

Danke-Boi avatar Sep 22 '22 17:09 Danke-Boi

Also, do try lower resolution. Your GPU is at 65% to hold 60fps with RB 0, and already at high power, and RBs force the GPU and CPU to share time, so when RB 1 happens, the CPU would have to finish everything in just 35%, which probably isn't happening regardless of how much spinning happens.

Didn't work, was actually slower than master at native resolution. Master got 35-37 FPS, PR got 30-32. All settings were the same, I double checked that the "spin" option was checked.

Danke-Boi avatar Sep 22 '22 17:09 Danke-Boi

I won't let me edit that comment, but I got it. I am uploading it to Google Drive, since GitHub doesn't like the file https://drive.google.com/file/d/1rew9Hdi1tckw1nqUKWiTqbPWh45tiYg0/view?usp=sharing

Okay some of the spins are going way too long and pushing back the actual draw execution image

This probably means that our timing code isn't working properly, I'm going to add a bunch of logging of times, please run from the terminal (PCSX2.app/Contents/MacOS/PCSX2) or with the log viewer open, and post some of the spam It should look something like this:

Draw completed in 1257µs (21397237993-21397239251), Waited 590µs (21397237402-21397237993)
Spin completed in 8475µs (21397239251-21397247727), Waited 1702µs (21397237549-21397239251)
Draw completed in 1710µs (21397249330-21397251040), Waited 292µs (21397249037-21397249330)
Spin completed in 2180µs (21397251041-21397253221), Waited 1973µs (21397249067-21397251041)
Draw completed in 9319µs (21397255348-21397264668), Waited 182µs (21397255166-21397255348)
Spin completed in 8375µs (21397264668-21397273044), Waited 9469µs (21397255198-21397264668)
Draw completed in 1711µs (21397280988-21397282700), Waited 6936µs (21397274051-21397280988)
Spin completed in 3316µs (21397282700-21397286016), Waited 8621µs (21397274079-21397282700)
Draw completed in 1335µs (21397286017-21397287352), Waited 5275µs (21397280741-21397286017)
Spin completed in 8308µs (21397287352-21397295660), Waited 6543µs (21397280809-21397287352)
Draw completed in 1719µs (21397297103-21397298822), Waited 304µs (21397296798-21397297103)

TellowKrinkle avatar Sep 23 '22 00:09 TellowKrinkle

I won't let me edit that comment, but I got it. I am uploading it to Google Drive, since GitHub doesn't like the file https://drive.google.com/file/d/1rew9Hdi1tckw1nqUKWiTqbPWh45tiYg0/view?usp=sharing

Okay some of the spins are going way too long and pushing back the actual draw execution

This probably means that our timing code isn't working properly, I'm going to add a bunch of logging of times, please run from the terminal (PCSX2.app/Contents/MacOS/PCSX2) or with the log viewer open, and post some of the spam It should look something like this:

Draw completed in 1257µs (21397237993-21397239251), Waited 590µs (21397237402-21397237993)
Spin completed in 8475µs (21397239251-21397247727), Waited 1702µs (21397237549-21397239251)
Draw completed in 1710µs (21397249330-21397251040), Waited 292µs (21397249037-21397249330)
Spin completed in 2180µs (21397251041-21397253221), Waited 1973µs (21397249067-21397251041)
Draw completed in 9319µs (21397255348-21397264668), Waited 182µs (21397255166-21397255348)
Spin completed in 8375µs (21397264668-21397273044), Waited 9469µs (21397255198-21397264668)
Draw completed in 1711µs (21397280988-21397282700), Waited 6936µs (21397274051-21397280988)
Spin completed in 3316µs (21397282700-21397286016), Waited 8621µs (21397274079-21397282700)
Draw completed in 1335µs (21397286017-21397287352), Waited 5275µs (21397280741-21397286017)
Spin completed in 8308µs (21397287352-21397295660), Waited 6543µs (21397280809-21397287352)
Draw completed in 1719µs (21397297103-21397298822), Waited 304µs (21397296798-21397297103)

Draw completed in 1164µs (148088513862-148088515026), Waited 6µs (148088513855-148088513862) Spin completed in 15993µs (148088515061-148088531054), waited 1145µs (148088513916-148088515061), counters 15993µs (148088515061500-148088531054791) Draw completed in 14527µs (148088525168-148088539696), Waited 18446744073699387µs (148088535332-148088525168) Spin completed in 3061µs (148088539721-148088542783), waited 4315µs (148088535406-148088539721), counters 3061µs (148088539721875-148088542783000) Draw completed in 1073µs (148088544331-148088545404), Waited 153µs (148088544177-148088544331) Spin completed in 16042µs (148088545438-148088561480), waited 1196µs (148088544241-148088545438), counters 16042µs (148088545438291-148088561480291) Draw completed in 13716µs (148088556773-148088570490), Waited 18446744073700736µs (148088565588-148088556773) Spin completed in 3294µs (148088570513-148088573807), waited 4855µs (148088565658-148088570513), counters 3294µs (148088570513458-148088573807583) Draw completed in 1125µs (148088574872-148088575997), Waited 11µs (148088574860-148088574872) Spin completed in 15991µs (148088576032-148088592024), waited 1115µs (148088574917-148088576032), counters 15991µs (148088576032625-148088592024041) Draw completed in 13688µs (148088586769-148088600458), Waited 18446744073699218µs (148088597102-148088586769) Spin completed in 3726µs (148088600485-148088604211), waited 3291µs (148088597193-148088600485), counters 3726µs (148088600485083-148088604211416) Draw completed in 1135µs (148088604929-148088606065), Waited 18446744073709543µs (148088604937-148088604929) Spin completed in 15993µs (148088606099-148088622092), waited 1077µs (148088605021-148088606099), counters 15993µs (148088606099166-148088622092166) Draw completed in 14671µs (148088616716-148088631388), Waited 18446744073698558µs (148088627709-148088616716) Spin completed in 3548µs (148088631415-148088634964), waited 3632µs (148088627782-148088631415), counters 3548µs (148088631415208-148088634964041) Draw completed in 1231µs (148088636003-148088637234), Waited 3µs (148088636000-148088636003) Spin completed in 16032µs (148088637270-148088653302), waited 1188µs (148088636081-148088637270), counters 16032µs (148088637270750-148088653302791) Draw completed in 13790µs (148088648164-148088661955), Waited 18446744073700164µs (148088657551-148088648164) Spin completed in 3850µs (148088661979-148088665830), waited 4359µs (148088657620-148088661979), counters 3850µs (148088661979791-148088665830166) Draw completed in 1146µs (148088666180-148088667326), Waited 5µs (148088666174-148088666180) Spin completed in 16051µs (148088667356-148088683407), waited 1123µs (148088666233-148088667356), counters 16051µs (148088667356625-148088683407958) Draw completed in 14531µs (148088678229-148088692761), Waited 18446744073695284µs (148088692496-148088678229) Spin completed in 3941µs (148088692786-148088696727), waited 214µs (148088692571-148088692786), counters 3941µs (148088692786500-148088696727541) Draw completed in 1139µs (148088697417-148088698556), Waited 1µs (148088697415-148088697417) Spin completed in 16034µs (148088698588-148088714623), waited 1109µs (148088697479-148088698588), counters 16034µs (148088698588750-148088714623000) Draw completed in 13104µs (148088709391-148088722495), Waited 18446744073700440µs (148088718502-148088709391) Spin completed in 3459µs (148088722517-148088725976), waited 3951µs (148088718566-148088722517), counters 3459µs (148088722517083-148088725976208) Draw completed in 1134µs (148088726874-148088728009), Waited 18446744073709542µs (148088726883-148088726874) Spin completed in 15986µs (148088728044-148088744030), waited 1069µs (148088726974-148088728044), counters 15986µs (148088728044166-148088744030291) Draw completed in 12972µs (148088738970-148088751943), Waited 18446744073700909µs (148088747612-148088738970) Spin completed in 4433µs (148088751963-148088756397), waited 4279µs (148088747684-148088751963), counters 4433µs (148088751963625-148088756397166) Draw completed in 1232µs (148088756139-148088757372), Waited 10µs (148088756129-148088756139) Spin completed in 15953µs (148088757405-148088773358), waited 1209µs (148088756196-148088757405), counters 15953µs (148088757405458-148088773358708)

I bolded the super long pauses, but surprisingly, very consistent.

Danke-Boi avatar Sep 26 '22 12:09 Danke-Boi

I did the math on the above, and that number ends up being 584.9424 years, so I'm assuming that's not accurate.

Danke-Boi avatar Sep 26 '22 18:09 Danke-Boi

I did the math on the above, and that number ends up being 584.9424 years, so I'm assuming that's not accurate.

Indeed, it's an unsigned 64-bit integer underflow (divided by 1000), because I didn't expect kernelEndTime to come after GPUStartTime. Conveniently, this is also what was causing the spin to run too long. On my AMD GPU, all the kernel processing happens at once, usually finishing before any GPU work starts. It appears it's different on M1. I switched the spin finish target from kernelEndTime to kernelStartTime, so hopefully that should fix everything. (GPUStartTime can cause issues on GPUs where GPU work can't run in parallel with spinning, like Intel GPUs)

image

Test now and see if you see improvements.

TellowKrinkle avatar Sep 26 '22 23:09 TellowKrinkle

I did the math on the above, and that number ends up being 584.9424 years, so I'm assuming that's not accurate.

Indeed, it's an unsigned 64-bit integer underflow (divided by 1000), because I didn't expect kernelEndTime to come after GPUStartTime. Conveniently, this is also what was causing the spin to run too long. On my AMD GPU, all the kernel processing happens at once, usually finishing before any GPU work starts. It appears it's different on M1. I switched the spin finish target from kernelEndTime to kernelStartTime, so hopefully that should fix everything. (GPUStartTime can cause issues on GPUs where GPU work can't run in parallel with spinning, like Intel GPUs)

Test now and see if you see improvements.

Interesting behaviour now, it starts at around 38 FPS when RB first hits 1, and slowly(over the course of a roughly 3 seconds) declines back to the same behaviour as before, instead of immediately dropping to 30/31.

Do you want another System Trace to see the behaviour in detail?

Danke-Boi avatar Sep 27 '22 12:09 Danke-Boi

Probably means we're not spinning enough now

Yes, another system trace would be helpful

TellowKrinkle avatar Sep 27 '22 17:09 TellowKrinkle

Probably means we're not spinning enough now

Yes, another system trace would be helpful

I can already see the not spinning enough lol, it is uploading to drive ATM, I will edit this with the link. Also, I started the trace, then switched over to the game at about 10 secs because it started recording earlier that I thought. https://drive.google.com/drive/folders/14-ADa2o1rqCrE54vlKYN0Sa8phbrbchk?usp=sharing Glad I found something to test this so quickly, otherwise I wouldn't be able to help, since spending 16 secs on a game is fine at school, but 3 mins+ of actively playing it...

Danke-Boi avatar Sep 27 '22 17:09 Danke-Boi

Could you run again with powermetrics/asitop? (Watch both CPU and GPU clocks when it's running fast vs slow.) The trace looks like the GPU is spinning as expected and the CPU time is what gets longer and longer. I can add CPU spinning and see if that helps.

TellowKrinkle avatar Sep 27 '22 18:09 TellowKrinkle

@Danke-Boi I added CPU spinning so test that and see if it helps (with both CPU and GPU spinning enabled, not just CPU spinning)

TellowKrinkle avatar Sep 28 '22 07:09 TellowKrinkle

Sorry, I am working on getting asitop installed, but it isn't working. I just get "legacy errors" even though it was updated for Monterey. I will relearn powermetrics...

Danke-Boi avatar Sep 29 '22 12:09 Danke-Boi

@Danke-Boi I added CPU spinning so test that and see if it helps (with both CPU and GPU spinning enabled, not just CPU spinning)

Erm, it made things somewhat better, I am up to 36 at the lowest, 39 avg when RB =1, but now I get frame drops elsewhere, when RB != 1, that I have never had before. Screenshot 2022-09-29 at 9 09 47 AM Screenshot 2022-09-29 at 9 09 41 AM Screenshot 2022-09-27 at 8 28 32 AM

GPU power is at half or less the power it was before, so that's cool I guess. I'm assuming you want another system trace, so I'm gonna do that rn.

Danke-Boi avatar Sep 29 '22 13:09 Danke-Boi

@Danke-Boi I added CPU spinning so test that and see if it helps (with both CPU and GPU spinning enabled, not just CPU spinning)

Trace link here https://drive.google.com/file/d/15HC3m20kKAFqfNJ6_MlSw2zvNxzfMqKf/view?usp=sharing

Danke-Boi avatar Sep 29 '22 13:09 Danke-Boi