[BUG] Deadlock cycle from thread pool implementation
Describe the bug As initially reported in adlai#5, SBCL occasionally reports deadlock caused by the thread pool.
To Reproduce Steps to reproduce the behavior:
- Run multiple parallel threads that constantly produce new tasks, similarly to scalpl's actors.
- Wait for RNGesus to schedule the deadlock cycle.
- Receive
Condition of type SB-THREAD:THREAD-DEADLOCK - Report this fact, and possibly additional findings, in this Issue; or invoke the
ABORTrestart and enjoy the hardworking actors.lisp rescheduling of the tasks.
Expected behavior Ideally, either there should not be any deadlock; alternatively, tasks could be restartable automatically.
Screenshots or Error Logs
Deadlock cycle detected:
#1=#<SB-THREAD:THREAD #2="ChanL Thread Pool Worker" RUNNING {1001369D53}>
waited for:
#<SB-THREAD:MUTEX "thread pool lock" free owner=0>
owned by:
#<SB-THREAD:THREAD "ChanL Thread Pool Worker" waiting on: #3=#<MUTEX "thread leader lock" contested owner=#2#> {100163B403}>
waited for:
(#3#)
owned by:
#1#
[Condition of type SB-THREAD:THREAD-DEADLOCK]
Restarts:
0: [ABORT] abort thread (#<THREAD "ChanL Thread Pool Worker" RUNNING {1001369D53}>)
Backtrace:
0: (SB-THREAD::CHECK-DEADLOCK)
1: (SB-THREAD::%WAIT-FOR-MUTEX #<SB-THREAD:MUTEX "thread pool lock" free owner=0> NIL NIL NIL NIL NIL NIL)
2: ((FLET "CLEANUP-FUN-8" :IN SB-THREAD::%CONDITION-WAIT)) [cleanup]
3: ((FLET "WITHOUT-INTERRUPTS-BODY-3" :IN SB-THREAD::%CONDITION-WAIT))
4: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE Anonymous condition variable {1003EEFA43}> #<SB-THREAD:MUTEX "thread pool lock" free owner=0> :TIMEOUT NIL)
5: (BORDEAUX-THREADS:CONDITION-WAIT #<SB-THREAD:WAITQUEUE Anonymous condition variable {1003EEFA43}> #<SB-THREAD:MUTEX "thread pool lock" free owner=0> :TIMEOUT NIL)
Locals:
CONDITION-VARIABLE = #<SB-THREAD:WAITQUEUE Anonymous condition variable {1003EEFA43}>
LOCK = #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
TIMEOUT = NIL
6: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD))
[No Locals]
7: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
Locals:
GOT-IT = T
MUTEX = #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
8: (SB-THREAD::CALL-WITH-MUTEX #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E51B}> #<SB-THREAD:MUTEX "thread pool lock" free own$
Locals:
GOT-IT = T
MUTEX = #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
SB-C::THING = #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E51B}>
TIMEOUT = NIL
WAITP = T
9: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD))
[No Locals]
10: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
Locals:
GOT-IT = T
MUTEX = #<SB-THREAD:MUTEX "thread leader lock" contested owner=ChanL Thread Pool Worker>
11: (SB-THREAD::CALL-WITH-MUTEX #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E79B}> #<SB-THREAD:MUTEX "thread leader lock" contes$
Locals:
GOT-IT = T
MUTEX = #<SB-THREAD:MUTEX "thread leader lock" contested owner=ChanL Thread Pool Worker>
SB-C::THING = #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E79B}>
TIMEOUT = NIL
WAITP = T
12: ((LAMBDA NIL :IN CHANL::NEW-WORKER-THREAD))
Locals:
TASK = #<TASK 6:55:56 3P4 USDCAUD fee tracker [TERMINATED] {1006535A93}>
CHANL::THREAD-POOL = #<CHANL::THREAD-POOL {1003857FC3}>
13: ((LABELS BORDEAUX-THREADS::%BINDING-DEFAULT-SPECIALS-WRAPPER :IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS))
Source locations for frames 12, 9, and 6:
(defun new-worker-thread (thread-pool &optional task)
(push (bt:make-thread
(lambda ()
(unwind-protect
(loop
(when task
(unwind-protect
(progn (setf (task-thread task) (current-thread))
(setf (task-status task) :alive)
(funcall (task-function task)))
(setf (task-thread task) nil)
(setf (task-status task) :terminated)
(bt:with-lock-held ((pool-lock thread-pool))
(setf (pool-tasks thread-pool)
(remove task (pool-tasks thread-pool))))))
(bt:with-lock-held ((pool-lock thread-pool))
(if (and (pool-soft-limit thread-pool)
(> (length (pool-threads thread-pool))
(pool-soft-limit thread-pool)))
(return)
(incf (free-thread-counter thread-pool))))
(bt:with-lock-held ((pool-leader-lock thread-pool)) ; [12]
(bt:with-lock-held ((pool-lock thread-pool)) ; [9]
(setf task
(loop until (pool-pending-tasks thread-pool)
do (bt:condition-wait (pool-leader-notifier thread-pool) ; [6]
(pool-lock thread-pool))
finally (return (pop (pool-pending-tasks thread-pool)))))
(decf (free-thread-counter thread-pool)))))
(bt:with-lock-held ((pool-lock thread-pool))
(setf (pool-threads thread-pool)
(remove (bt:current-thread) (pool-threads thread-pool))))))
:name "ChanL Thread Pool Worker")
(pool-threads thread-pool)))
System (all systems where this has been observed):
- Hardware: X86-64 AMD EPYC 7543 32-Core Processor
- OS: Arch Linux, Debian
- Compiler: SBCL
- Version: multiple; most recently, 2.2.9.debian
Test Suite Regrettably, the test suite does not detect this deadlock. I'm considering whether it is worth including a sufficiently vigorous test case that the deadlock occurs with high probability.
#32 and 4cbbcda begin fixing this
I strongly suspect this is an SBCL bug, rather than a ChanL bug; specifically, although the thread pool uses two locks, they are always taken in the same order -- by threads running the same code! -- and thus deadlock should not be possible. Some complication from the inner lock's release and reacquisition being triggered by a condition variable is probably at fault for the appearant and unintended deadlocks.
I wrote above "apparant and unintended", to slightly shift blame away from SBCL's deadlock detector. Deadlocks definitely do occur.
EDIT: see the printed representation of the deadlock cycle, in the bug report; owner=0 suggests to me that, by the time the cycle is printed, there is not a deadlock anymore, and thus the actual deadlock is caused by some race condition between the condition variables and mutexes provided by SBCL.
There is progress, it is not quite as hopeless as "Stop using SBCL"