RetroArch
RetroArch copied to clipboard
Savestates save or load with threaded tasks off slows down in proportion to the console savestate size.
Description
The title means that savestates and loadstates get slower the more memory the savestate has nonlinearly. While a 'small' savestate, from the nes for instance, might take two seconds, the ps1 takes 50 seconds, the ps2 several minutes etc.
This can easily be confirmed by turning off threaded tasks in the advanced settings -> UI menu then attempting to save in any ps1 game:
It can also be also checked by installing bpftrace sudo apt install bpftrace then using it to record the number of times the savestate dispatch functions are called and the interval of time between them. This delay also occurs during load state so these interruptions severely disrupt IO on both cases.
(these retroarch paths are where i keep my master build)
sudo bpftrace -e 'uprobe:/home/i3/Documents/Projects/RetroArch/retroarch:task_push_save_state { @start=nsecs;} uprobe:/home/i3/Documents/Projects/RetroArch/retroarch:task_save_handler { @probe = count(); } uprobe:/home/i3/Documents/Projects/RetroArch/:task_save_handler_finished { printf("elapsed ms %d:\n", (nsecs -@start) / 1000000); exit(); } '
Expected behavior
I expected that the number of calls of task_save_handler per slider is fixed (so SAVE_STATE_CHUNK is not hardcoded, but variable based on the platform savestate size), or this whole 'segment the savestate to display a slider` thing does not happen when threaded tasks is turned off, just a final 'saved state' message, or a persistent one that is replaced by 'saved' at the end, so two calls.
Actual behavior
The nonsense in the video occurs, in the platforms that can afford it the least; and if you'd like retroarch not to have a extra UI thread used for basically nothing, you suffer this on the savestates and probably downloads too.
Steps to reproduce the bug
- Turn off threaded tasks
- Load a game of any platform at 16+ bits, later ones are much much worse
- save or load a savestate
Bisect Results
This always happened
Version/Commit
You can find this information under Information/System Information
- RetroArch:
RetroArch: Frontend for libretro -- v1.10.3 -- 3abd414656 --
Environment information
- OS: Ubuntu 22.04.1
- Compiler:
GCC (11.2.0) 64-bit Built: Sep 3 2022
This is a duplicate of a bug that was closed inexplicably since this problem is very much not fixed or irrelevant.
I also suspect that even on threaded tasks this might be delaying the savestates or loadstates noticeably by forcing IO to be written or loaded with minuscule amounts of bytes at a time. A fixed amount of minuscule bytes, which of course makes large savestates slower to write or read too, even with threads.
This is actually slightly surprising in that i expected if the threaded tasks were turned off and this delay and absurd number of callbacks happened, that a savestate would show a lot of delays in the emulation (ie: tank the framerate).
This doesn't appear to be a thing that happens, which makes me think that the savestate already is being written in a another thread than the core main thread. The alternative that those number of callbacks are actually necessary to not pause the emulation without the threaded tasks is too horrible to contemplate. But i think it shouldn't be like this, since the RA UI and the core emulation should be different threads in the end, even without 'threaded tasks' for the UI. Nothing else makes sense otherwise just displaying the framerate would have performance implications.
So I might have to refresh my memory but I recall when we did the task queue implementation, my thought was that if the task queue was not available on a platform because of no threading, that these non-threaded 'tasks' should not 'block' or interrupt the UI or even slow down the gameplay much. Hence why some kinda dodgy heuristic was chosen to split up the workload in 'chunks' based on what would fit into a frame without ruining the framerate too much.
There might be better ways of going about this and it might be that the heuristic that was used for this was total amateur level and we might have to do something based on disk I/O performance and some other heuristics, but at least the thought behind it is/was decent. If you have any suggestions on how to improve it while still abiding by this design ideal, I'm all ears.
I believe even image decoding works on a similar principle although I believe there we need some improvements made, PNG decoding tends to take way too long right now in a weird way (since it's not bottlenecked by disk I/O or even CPU as far as I've been able to tell).
TLDR - we need more intelligent ways of deciding chunksizes for non-threaded tasks.
Worst case possible.
Well, let's just say i don't agree with those limitations. If i was in a platform that had no threads i wouldn't want to my loadstates/savestates to literally take minutes at a time (and this is in a recent computer).
So I'd abandon that and pause emulation during savestate/saveload in a non-thread environment, and use larger SAVE_STATE_CHUNK always (so that it gives a fixed number of iterations, say one every 0.5% of the total), or just for the threaded case if you want to be 'best efficiency' if you're going to pause.
That way, saves 'still are fast', they're slightly faster for threaded computers (because they write more), there are 200 iterations, so it's minimally smooth, and the people that need the speed the most don't get to wait anxiously for the savestate/saveload to load forever for a 16 mb savestate (2817 iterations according to bpfprobe - which doesn't seem like nearly enough for the pathological slowdown, but it seems interrupting writes has more effect than expected).
And people on limited platforms aren't waiting for the savestate to save so they can load it much much longer than necessary, which is a very common pattern.
That's my opinion. You could make it a special 'saving' progress bar that shows it's paused (or a overylay over the screen) just for the 'non-threaded' case. If you really want it, i suppose it's possible to skip the progress bar in this case, so it would only be writing and showing the special pause and unpause, which should be the fastest possible for those platforms.
Downloads in a non-thread case i'm not sure about. Probably just keep them as is, or don't allow 'resuming' or starting a core while there is a download active. It's easily to figure out you shouldn't be downloading something on a limited platform and playing at the same time - unlike savestates where you usually are ingame to use them.
This was also something I saw in Emscripten builds https://github.com/libretro/RetroArch/pull/15845/files, I wish I had found this bug report then. Web is kind of a special case because the files are all in memory (so a 16MB chunk size is fine), but syscalls have so much overhead it’s worth considering larger chunk sizes on more platforms. Especially for loading a state, pausing emulation and prioritizing the load feels right.
Would it make sense to change the chunk size to a megabyte or so? Even on eMMC it should be possible to read 30 MB or more in a second, so within one frame you could read 512KB or 1MB without hurting anything. 4KB is just really tiny. I might be missing something because I don’t know about all the supported platforms.
Depends on how much you want the slider to jump. The current value is low so it can smooth (and slow :( ) both the smallest and largest savestates. It's possible to manipulate the state of the chunk at runtime instead to always go up by 1% (so 100 calls) in all emulated platforms instead, just by calculating 1% of the savestate size as the chunk.
Personally I'm hoping someone masochistic enough codes a alternate path to disable the task division and just write out the buffer all at once with a non animated 'saving save #...'. I believe it would make dolphin savestates muuuch faster (and others too).
And there is the bug I linked here that appears (to me) to be the android scheduler slowing down RA because of this way to synchronize the animation with IO.
This bug still bothers me but I've since accepted it exists because RetroArch doesn't actually have a thread apart for making savestates when treaded tasks are disabled, so instead of pausing emulation it does this.
I don't accept the rationale though, or its necessity because of the absurdly bad way that segmenting I\O writes degenerates performance. Even when I had that old computer (I have a even worse now lol) if I compared dolphin upstream and dolphin core savestates WITH threaded tasks for the core... well the core was slower. By seconds. Dolphin was like two seconds to save, and this was on a rotational hard disk. Dolphin core was more like... 5.
Modern I\O is speculative. If you start writing a array to disk every part of the system warms up in the expectation that you will indeed write the whole thing out, and context switches are not going to happen and the data will sit in various minimal buffers waiting to be beamed out to the SSD or HDD.
Unless you interrupt it. This is why dolphin oh so immersion breaking pause is a better experience than RA segmented write (well besides the absurdity of the fixed value chosen for the fps pause apparently - a rationale I find suspect in itself, because of that warmup a cpu writing to disk raw array data is almost certainly going to pipeline that stuff in a way it can do something else). You could press the button and go right to gameplay and press a button again without fear that you're still writing out a savestate and you new one would happen much later than expected.