dolphin icon indicating copy to clipboard operation
dolphin copied to clipboard

Rollback Reality

Open Sage-King opened this issue 1 year ago • 10 comments

What is rollback?

Rollback is a type of networking code. Two players connect and only share user-input data with each other(no game/emulator state). Every input is timestamped. Each player simulates their own copy of the game with no blocking for the remote user’s input. When the local copy of the game/emulator receives data from the remote user that is before the current time, the game/emulator “rolls back” aka resets the local copy to when that input would have arrived. The local game/emulator simulates the game(without writing to the screen) back to the current frame with those new remote inputs. If a game receives no remote user input, it assumes that the remote user’s input is the same.

This results in minimal visual artifacts and allows each local copy of the game to play the same way it would if both players were playing on the same physical console. It feels great for the players if done correctly. :) There are lots of more in-depth resources online explaining rollback and I am by no means an expert. Please read those if you need more info.

Goal:

I'm targeting 300ms of network latency with a 2-frame static input delay for up to four people.

Plan to Achieve Goal:

Get bare minimum rollback working on my local machine using existing infrastructure Verify rolled-back frames are correct and set up infrastructure for continual verification of rollback correctness. Optimize until performance goals are met or we prove they cannot be met The previous three points are a high-level overview. The hard part comes with the implementation and that is not something I think I can plan for entirely beforehand. The following are some of my ideas to achieve those five bullet points. I will add any suggestions and test all of them to see if they have merit.

Bare Minimum Rollback:

Write logic to use current savestates (maybe the undo buffer?) to rollback Slow emulator down and/or lower framerate to make this doable with the currently crazy slow save states >.< Take off the emulation speed cap to try to catch up after loading from savestate Introduce latency and/or speed emulator back up and/or raise framerate until there is cascading lag (system can’t keep up) Verify Correctness:

Make a tool that saves the rolled-back state of the emulator at each emulator step (or maybe at time intervals?), runs the emulator without rollback to compare the state of the emulator to the recorded steps from the rolled-back frames, then alert and enumerate any deviations. Optimization:

Cut out Code on the Hot Path: (That isn’t necessary for correctness. Looking at you Super Mario Galaxy EFB copies >:( )

Cut out any user-input gathering Cut out any rendering Cut out any dynamic allocations Cut out any file read/writes Cut out any networking Create Low Latency Emulator State Read/Writes:

Custom allocator for everything in the rollback system Reduce the size of emulator save states as much as possible

  • Dirty page tracker
  • Dirty byte tracker
  • Save actual RAM Add JIT’d code to savestates (might already be done) Make save states a consistent size to allow for static buffers Group save state data into cache lines (to make sure we don’t load the same data twice if we can avoid it) Order save state data to allow for continual, aligned, sequential (or otherwise easily predictable) memory accesses. (kinda already done? Might need some revisions if I standardize the size of save states. Definitely will need to be aligned if that turns out to be bogging some performance.) Multi-threading? Make the emulator emulate faster:

Look at any of the //TODO optimizations that have been suggested in the code (if I can find them again ;-; ) Multi-threading? Verifying JIT’d code uses the most recent SIMD instructions (or otherwise fully utilizes the CPU), pipelines as well as possible, is stored contiguously (or otherwise predictably accessible for the pre fetcher), branches stored on cache lines if possible, and has as much unpredictable code removed as possible. Identify any slow algorithms and replace them Identify any slow data structures and replace them Remove as many dynamic allocations and file read/writes as possible. Remove as many system calls as possible Make sure threads aren’t being oversubscribed General:

Utilize as much of the CPU/GPU/System as possible(We can always tune it down if needed. Currently Dolphin uses an incredibly small amount of my CPU.) Potential Enhancements:

Share Gecko Codes (this might come by default when this is fully implemented) All of these suggestions will need to be tested to make sure they actually give performance benefits. None of these are set in stone and some of them might make the emulator slower.

I only have access to my Haswell Intel CPU so all measurements will be made on it. Fortunately, from my understanding, most of my optimizations will be portable and increase performance for any modern CPU. (pre-fetcher, pipelining, branch prediction, etc.) Regardless of the increase in performance for other CPUs, I will try to make sure my optimizations do not make the performance of Dolphin worse.

If you are willing to test my optimizations on your CPU (extra points if it’s a different ISA, manufacturer, etc.), please let me know on IRC. My name is Wadwamille on IRC.

Sage-King avatar Jul 18 '22 16:07 Sage-King

If I understand correctly, you're trying to allow Dolphin's netplay to use the rollback mechanism, am I correct?

I don't want to sound too pessimistic but AFAICT projects like project-slippy had to both customise the emulator and implement the client in the game itself. Which obviously Dolphin doesn't do (patching the game's engine to support the rollback protocol). Patching the game's engine is probably more efficient than rolling back the entire emulator state using a savestate.

I don't think savestates (as per Dolphin's savestates) are suitable for such a use case, though, I might be wrong. Savestate saves lots of pieces of information and is quite slow even with good SSD and RAM. Moreover, anything kernel/driver related (audio, network, some GPU/CPU features, etc.) can't support rollback easily. For instance, how to rollback a write to a socket/file/driver/... without side effects?

Have you performed some benchmark on Dolphin's savestates? If so, do you think your goal is currently achievable?

targeting 300ms of network latency with a 2-frame static input delay for up to four people.

sepalani avatar Jul 20 '22 09:07 sepalani

Yes, I have done benchmarks on Dolphin's savestates. Yes, I think it's doable. If it's not, I'm going to show why it's not doable, with data. If I can show with data that none of my alternatives to savestates are fast enough, I will consider this project finished. I understand this is an ambitious project, but fun projects are all ambitious imo.

Yes, savestates in their current implementation are probably not suitable for rollback. The current commit has a working version of rollback (for gamecube)with savestates and it's running at 6fps when the configs are on my SSD and 3-0fps on my HDD. (It's a particularly slow 2TB HDD that I think was meant for servers. It was cheap.) The ISO for both cases is on the super slow HDD. I'm going to try to optimize savestates to work with rollback. If that doesn't work, I will restructure Dolphin's memory usage to better support pre-fetching during savestate loads and stores. PointerWrap in ChunkFile.h is currently my archnemesis, but it's like veins, every section of Dolphin uses it. If that's not enough, I'll custom write rollback savestates. Custom rollback savestates may need to have some limitations on them. I have many more ideas for optimization, some of which I've laid out in the initial PR comment. I'll either find optimizations that work or collect enough data to reasonably show that rollback can't be done at all. Currently, Dolphin uses an incredibly small portion of my computer's hardware resources. I'd like to have it coded so it can use 100% of a computer's resources.

How do savestates handle read/writes to volatile systems currently?

I haven't seen any special edge cases looking at the code and savestates seem to work perfectly fine. If reading/writing to sockets, drivers, files, or whatever else may have a permanent effect causes issues, I'll do what all speculative execution does. Read/write and store those in a buffer on the stack or heap, then have the actual read/writes happen once those frames are confirmed. I call it deferred execution, but I'm not sure if there's a more technical name the CPU developers use. If deferred execution doesn't work, I'll find a workaround. Reading/writing to volatile systems may need to be disabled for rollback.

I am also still grappling with how to verify rollback is correct and handle edge cases. I need to do a little more research and progress a little farther into the testing of rollback to get a better feel for how I ultimately want to handle reads/writes to sockets, disk, etc. That's my next step in the plan laid out in the initial PR comment. I may try to optimize a little bit so I'm 10+ fps before I do that. It's a little hard to verify that rollback is working if the entire emulator is running at 0.5fps. I still have lots of research to do before I'm close to answering your questions with data and implementations. I would be happy to let you know in IRC or on here if you want to see how I handle the cases you brought up when I find a solution.

Sage-King avatar Jul 20 '22 17:07 Sage-King

Yes, savestates in their current implementation are probably not suitable for rollback. The current commit has a working version of rollback (for gamecube)with savestates and it's running at 6fps when the configs are on my SSD and 3-0fps on my HDD.

This is just an idea and I'm not an expert on Dolphin's code, but if the disk I/O is that much of a bottleneck, then could the rollback logic be modified to store the temporary savestates in memory instead of on the disk?

bdr99 avatar Jul 20 '22 17:07 bdr99

They do store on the heap, but it's also like 400mb/s with how many savestates I need to make lol. I have quite a few thoughts on why it's so slow, but I need to test more to see if those are correct.

Sage-King avatar Jul 20 '22 17:07 Sage-King

They do store on the heap, but it's also like 400mb/s with how many savestates I need to make lol. I have quite a few thoughts on why it's so slow, but I need to test more to see if those are correct.

OK, got it. Thanks for clarifying!

Looking forward to the results of your analysis on whether or not this feature is feasible.

bdr99 avatar Jul 20 '22 18:07 bdr99

but if the disk I/O is that much of a bottleneck, then could the rollback logic be modified to store the temporary savestates in memory instead of on the disk?

Even just saving to memory will still be too slow; Good DDR4 ram is limited to ~30GB per second, which gives you 500mb of bandwidth per frame at 60hz. Simply saving out all of the Wii's 96MB of memory has already eaten up 40% of your per-frame memory bandwidth, and you still need time to run emulation and restore some of those states.

But with many optimisations it might be possible.


As for your Sage-King's optimisation ideas:

Dirty page tracker

Great idea. Will help with the limited memory bandwidth issues.

One related idea I've always wanted is a way to track memory that is unmodified since being DMAed from disk. That way we won't even need to save them to an (on-disk) savestate,

Dirty byte tracker

I somewhat doubt going down to byte granularity will be worth it; even cacheline granularity seems like overkill.

If you are doing a sub-page granularity, consider the 512 byte alignment used by various bits of flipper hardware (like efb copies and texture base address, potentially other things?)

Add JIT’d code to savestates (might already be done)

I don't think this is wise to put the actual jitted code in savestates (especially not to disk). Instead just make the code cache able to keep invalidated code around and then un-invalidate it and keep a list of changes since the last save state.

Same goes for other caches, like the texture cache. You might want to introduce the concept of "savestates that are likely to be restored" so parts of dolphin code can track their state relative to them and allow faster restores.

Multi-threading?

Is there any point in multithreadding the actual memcpy? My gut says a typical consumer x86 cpu can get close enough to optimal throughput for memcpys with just a single core (though worth checking?)

But might be worth multithreading different sections of the savestate code, so some of the more computationally expensive stuff can run in parallel.

Make the emulator emulate faster

Well, yes. Basically your only fallback option if all else fails.

Though I suspect you are cpu thread bound, and from memory that's mostly JIT bound. You might end up needing to do massive JIT re-writes if you get that desperate.

I think there is fertile ground in trying to make JIT compile whole functions, and making sure large functions don't get partially inlined to their callers unless it's worth it. Should cut down on icache usage. Also opens up the opportunity to do whole-function optimisations.

Which probably requires moving to the concept of a 2nd-level JIT to optimise hot parts of code.


Some more optimisation ideas of my own, that you didn't appear to mention:

  • Can we add snapshot support to fastmem? Would allow us to resume emulation before the savestate even completes;
    • Another thread can continue copying out memory to complete the savestate.
    • Even better, the snapshot can be part of the savestate.
    • If that works, snapshotting could be extended to other components.
  • HLE replacement of threading within games.
    • Reduces the amount of code to be run though the jit
    • gives us better insight and even control over the game's threads
    • Better detection of when the game is idle
  • Move the games input polling to right before rendering starts; Eliminate any concept of "multiple frames in flight" within the game.
    • Might need to be done with patches, though maybe it's possible via HLE (the threading above)
    • This way the GPU is guaranteed to be idle, and it simplifies the amount of state to savestate on the GPU thread.
    • Additionally, make dolphin reset the gpu state to a consistent state at the beginning of each frame. Even less state to savestate.
  • On the topic of JIT optimisations: Make the jit optimise out write/gather pipe writes and inline the Opcode Decoder.
    • Opcode Decoder is about the only part of gpu emulation that really needs needs to be done on the cpu thread.
    • I actually have some minimal work towards this on a really old branch.
    • Should combine really well with whole-function optimisation and constant propagation.

phire avatar Jul 30 '22 00:07 phire

Is there any point in multithreading memcpy?

Yes, but it probably has minimal benefit (and maybe too much overhead for any benefit) and likely makes the changes less portable since I'd need to use intrinsics. I did read a paper about someone who did beat memcpy though.

Dolphin has ~20 threads running at any given time. I'd rather not add to that if I can either.

Mulithreading different sections of savestate

Eh? Maybe. Sounds like a gigantic pain, and might not give any performance boost. There are some things that need to be loaded in a certain order as well. On top of that, I'd essentially need to completely scrap the current savestates in favor of a really large struct since savestates are byte-by-byte serialized together, completely unaligned. I may end up doing this anyway because I think it'd help Dolphin's performance overall. (more things closer together = fewer cache misses) However, I started it, and it was a pain. A HUGE pain. save states and the weird way they're generated are in basically every section of the code. Many base class pointers are saved and I can't save those without including some kind of child class information as well. (at least I think I can't for correctness). I will be avoiding this restructuring of Dolphin at basically all costs if I can.

Save JIT'd code that's been invalidated as well as other caches that have been invalidated

That is an excellent idea that I would not have thought of. I will keep that in mind when I start optimizing.

. . . thought I suspect you're CPU thread bound.

From my profiling, it seems like Dolphin is primarily memory-bound. It only uses about 30% of my CPU when running (I do not have a threadripper or xeon; this is four cores we're talking about). VTune says icache is primarily responsible for the speed with a little bit of dcache.

Hopefully I won't have to rewrite large sections of the JIT ;-; I will do that if necessary tho.

512 granularity for dirty pages

Eh, I'm not sure I even want to go below L1 cache honestly. The more I learn the sillier it seems to dedicate so much memory and processor time to reducing save state size. It might be worth it, but it's on the bottom of my list to try. One thing I may try to do is get savestates mostly or entirely into my L3/L2 cache (8mb, 4mb) and have different cores work on sections that are L1 sized. I definitely need to reduce save state size as much as possible since by your calculations and my experience, save states are not as fast as they need to be to implement rollback right now.

I also wanted to clear up that I only save to memory. I'm using State::SaveToBuffer(); and State::LoadFromBuffer();. I (hopefully) am not writing to disk at all.

Can we add snapshot support to fastmem?

Since I don't know what fastmem is or how it's used, yes, but it will take time. I'll also need to see if we actually need the performance.

HLE replacement of threading within games.

I'm a little confused. I thought HLE was a DSP thing. It sounds like you're talking about managing in-game threads? If so, that sounds cool and I'm interested, but I'd need to learn a lot more before I tried that. I also thought Broadway/Gekko were single-core? I'd appreciate more elaboration and explanation, no biggie if you don't tho.

Move the games input polling to right before rendering starts; Eliminate any concept of "multiple frames in flight" within the game.

I have done something like this for rollback mode already. I only poll and send inputs on the end of the frame. It was to help with correctness, but it could probably be extended to more parts of the emulator if needed. I will definitely keep that in mind

Additionally, make dolphin reset the gpu state to a consistent state at the beginning of each frame. Even less state to savestate.

Yes, this sounds amazing :o Pokechu also pointed out in another PR that there is a 4mb table getting saved that probably doesn't really help anything at all. If I could ditch that table and the GPU save state, that would be awesome. I didn't know this was an option, but I will definitely be looking into it as one of my first optimizations.

On the topic of JIT optimisations: Make the jit optimise out write/gather pipe writes and inline the Opcode Decoder. . . . Should combine really well with whole-function optimisation and constant propagation.

I know what all of those words mean and I do not have the context to put them together. It sounds like a solid optimization from some light research though. I'm not super familiar with any compiler optimizations yet, but I will brush up and implement this if the JIT'd code turns out to be the bottleneck.

Sage-King avatar Jul 30 '22 17:07 Sage-King

I did read a paper about someone who did beat memcpy though.

Such papers are always written on big server platforms, with way more memory channels and are often NUMA. I'm not sure how well they map down to consumer CPUs with dual-channel memory and a ringbus.

Dolphin has ~20 threads running at any given time. I'd rather not add to that if I can either.

Eh, the cost of creating threads isn't that high; It's switching/syncing that causes issues. Many of those threads are just sitting there not even scheduled, waiting for events.

But I was thinking you could use the gpu thread, since it already exists.

Mulithreading different sections of savestate

Eh? Maybe. Sounds like a gigantic pain, and might not give any performance boost

Correct. Only look into this if you manage to refactor the state code enough so that the size of regions can be predicted upfront.

VTune says icache is primarily responsible for the speed Hopefully I won't have to rewrite large sections of the JIT

Yeah, my understanding is that JIT just produces too much code and often doesn't fit in icache.

Hopefully you shouldn't need to much (if any) work JIT. The scenario were it makes sense to do major rewrites is a bit narrow, you would need to get savestate time short enough that it can run realtime without the emulated cpu doing any work, but not quite fast enough to also run our current jit.

I'd actually put it below the 512 byte pages on your list of things to try.

The more I learn the sillier it seems to dedicate so much memory and processor time to reducing save state size.

Certainly memory; it destroys your caches. But often spending extra cpu time calculating things is pretty cheap, and can execute in parallel with stuff that's already memory bound.

I've seen papers pointing out that compressing data can be faster than a memcpy at times.

Can we add snapshot support to fastmem?

Since I don't know what fastmem is or how it's used, yes

Fastmem is dolphin's approach to abusing pagetables to speed up memory address calculation. We allocated 4 GB of virtual address space, and leave it mostly un-mapped, except for the addresses corresponding to actual memory in gamecube/wii memory maps. the The JIT can just add the base of that region to any addresses and do the load/store. It will get a pagefault if there isn't emulated memory at that address.

My thought was that it might be viable to update the pagetables in this mapping at a page granularity on every savestate instead of copying the actual data. Though now I think about it, handling pagefaults and doing copy-on-write would be slower than just doing a memcpy upfront. Pages that are modified one frame are likely to be modified again.

HLE replacement of threading within games.

I'm a little confused. I thought HLE was a DSP thing.

Dolphin doesn't do much HLE. We HLE the DSP and we to a tiny bit of hooking functions on the cpu. But it might be viable to push it further.

Gecko/Broadway might be single core, but the SDK does provide a threading library for games to use; Mostly so other work can be done while waiting for the gpu and other IO.

phire avatar Jul 30 '22 19:07 phire

I did not mean to close this. I didn't know GitHub stopped your PR if you reset your branch to master.

Sage-King avatar Jul 31 '22 18:07 Sage-King

One of the problems with save states on Dolphin is the compression method. The compression called LZ0 is used for compressing save states, meaning that loading the states is pretty slow because of the way it decompresses, which is not good for rollback. You could implement a different compression method like zstandard, which is very fast at decompressing

Adamillo avatar Aug 01 '22 07:08 Adamillo

A better way to reduce state size in the first place is just a dirty page tracker as said earlier in this PR. That itself would score massive wins for state size (as that itself would do most of the work compression would normally do). Of course this means compression is much less effective, but on the flipside compression could potentially be set by a user to be not done due to the "compression" already done by a dirty page tracker.

CasualPokePlayer avatar Aug 11 '22 02:08 CasualPokePlayer

@Sage-King If you don't mind me asking, what happened with this PR? I see that it was merged, but it looks like no lines were changed?

bdr99 avatar Aug 11 '22 02:08 bdr99

@bdr99 I accidentally pushed the same commit as Master. GitHub bug said it merged, so I'll be opening a new PR in the future. AFAIK the merge can't be undone and this PR can't be revived.

Sage-King avatar Aug 11 '22 02:08 Sage-King

@CasualPokePlayer Yeah, I intend to try a page tracker since it doesn't seem too hard. Right now I'm getting actual rollback working which is kicking my butt. Navigating three threads for my first multi-threaded project is a lot. ;-;

Sage-King avatar Aug 11 '22 02:08 Sage-King