touchHLE
touchHLE copied to clipboard
Semi-co-operative threading model is prone to deadlocks when host-to-guest calls stack
touchHLE has a hybrid of co-operative and pre-emptive multitasking. There is full pre-emption for guest code (we just swap out the register state for threads like any OS would), but because everything runs on a single host thread and we don't use coroutines, host code is co-operative: other host or guest code can't execute until host code is done.
A simple case where this can be problem is with the current implementation of sleep()
. If we have two threads which are both sleeping, the host call stack can become this mess:
- thread 1 calls
sleep()
-
Environment::run_call()
->Environment::run_inner()
- switches to thread 2
- thread 2 calls
sleep()
-
Environment::run_call()
->Environment::run_inner()
- spins until thread 2 wakes up. if thread 1 wakes up earlier, we are still stuck here!
- thread 2 wakes up and signals return-to-host
-
sleep()
returns
-
- thread 2 calls
- waits until thread 1 wakes up
- thread 1 wakes up and signals return-to-host
- switches to thread 2
-
sleep()
returns
-
Sleeping for longer than necessary is relatively benign, but there can be far worse consequences. Potentially we can end up with deadlocks.
In any case where
- guest code on thread A calls a host function,
- that function calls another guest function on thread A,
- while in that call,
Environment::run_inner()
switches threads to thread B, - guest code on thread B calls a host function on thread B, and
- that function calls another guest function on thread B
then we will end up in a deadlock if the inner guest function on thread B is waiting on something in the outer guest function on thread A. This is because thread A's host function call can't return while we're busy with thread B's host function call.
These deadlocks could happen randomly! There's no guarantee of when thread switching will or won't happen. And of course, we have this guest-to-host-to-guest call structure very frequently, for example because of objc_msgSend
.
This isn't a theoretical issue, @ciciplusplus has already hit problems related to this when trying to get DOOM working.
This is one of touchHLE's most fundamental design limitations, and it's not going to be easy to fix. There's only two solutions I can see:
-
The complete solution: switch to using real threads, i.e. a 1-to-1 relationship between guest threads and host threads. This solves all deadlock problems and is conceptually cleanest, but it would be very annoying considering Rust's thread safety rules. Right now every host function gets a
&mut
on the entire emulator state (&mut Environment
), or on any smaller part of it, including the guest memory, Objective-C host objects, etc. All of this seems fundamentally incompatible with threading… -
A partial solution: try to avoid guest-to-host-to-guest call stack patterns as much as possible. I can rework
sleep()
again, changeobjc_msgSend
to not do a host-to-guest call except where necessary, and perhaps move to using some kind of continuation-like thing (async functions?) to make host functions more co-operative. There's some low-hanging fruit here that can get us a long way, but doing it comprehensively is something else. We'd inevitably have a long tail of rare, difficult-to-fix bugs.