solid_queue icon indicating copy to clipboard operation
solid_queue copied to clipboard

Fix a couple of race-conditions on shutdown

Open rosa opened this issue 1 year ago • 0 comments

This fixes #108 and two other issues I found while working on that one:

  • Only try to release next unblocked job if the job actually completed, either successfully or with a failure, but it needs to have completed. Otherwise, we might be still claimed but signal the semaphore regardless, so it'd be lying about how many jobs are in progress. A good example where this might happen is when the worker is sent a QUIT signal to exit right away and the thread pool is killed. As the worker or the supervisor would try to release claimed executions after the shutdown, the claimed execution that holds the semaphore could be potentially blocked because the semaphore is held at least by itself. Then, depending on the order of the thread pool shutting down and the worker being deregistered, we could end up with a job trying to unblock itself and a stuck semaphore.
  • ~~Similarly to the above, don't go through the general dispatch flow when releasing claimed executions. That's it, don't try to gain the concurrency lock, because claimed executions with concurrency limits that are released would most likely be holding the semaphore themselves, as it's released after completing. This means these claimed executions would go to blocked state upon release, leaving the semaphore busy. Just assume that if a job has a claimed execution, it's because it already gained the lock when going to ready.~~ Shipped this one separately in #121, with proper tests.
  • Use exit instead of exit! on immediate termination in runnable processes so that at_exit hooks are run if needed. Besides, remove logging for failing to deregister a process as it just adds noise, and we were re-raising the exception anyway.

rosa avatar Jan 10 '24 20:01 rosa