riot icon indicating copy to clipboard operation
riot copied to clipboard

Process-stealing dead lock

Open leostera opened this issue 1 year ago • 0 comments

When running on a large number of cores, the current process stealing starts dead-locking schedulers and shows a few other bugs:

  • a process gets queued up in several schedulers, which is likely a bug in the Proc_queue or Proc_set, and once its terminated in one scheduler, the next scheduler that tries to run it will fail because finalized processes should never be put on a queue.

  • when moving timers around sometimes a timer will get triggered on a scheduler before its moved out of it – moving timers to the IO scheduler helps, and can improve the reliability of the timers since the polling workload has a strict deadline, but also means reworking the timeouts for receives and syscalls.

I've been unable to fix with additional safeguards (like more restrictive locking of the process queue), but I have identified that the Proc_set is not working as intended (likely due to the use of Atomics instead of a lock).

In the meantime main has disabled process-stealing until we figure out next steps here.

This is a good time to step back and maybe rewrite the scheduler into more module pieces that can be easier to reason about and test.

leostera avatar Jan 30 '24 17:01 leostera