riot
riot copied to clipboard
Process-stealing dead lock
When running on a large number of cores, the current process stealing starts dead-locking schedulers and shows a few other bugs:
-
a process gets queued up in several schedulers, which is likely a bug in the Proc_queue or Proc_set, and once its terminated in one scheduler, the next scheduler that tries to run it will fail because finalized processes should never be put on a queue.
-
when moving timers around sometimes a timer will get triggered on a scheduler before its moved out of it – moving timers to the IO scheduler helps, and can improve the reliability of the timers since the polling workload has a strict deadline, but also means reworking the timeouts for receives and syscalls.
I've been unable to fix with additional safeguards (like more restrictive locking of the process queue), but I have identified that the Proc_set is not working as intended (likely due to the use of Atomics instead of a lock).
In the meantime main
has disabled process-stealing until we figure out next steps here.
This is a good time to step back and maybe rewrite the scheduler into more module pieces that can be easier to reason about and test.