recursive error from FORCE-OUTPUT on SLIME-OUTPUT-STREAM with *BREAK-ON-SIGNALS* set to T
on:
sbcl 2.1.3.122.master.5-177bc6b03, x8664 windows
GNU Emacs 27.1 (build 1, x86_64-w64-mingw32) of 2020-11-19
slime-v2.26.1
(let ((*break-on-signals* t))
(loop repeat 500
;; random chars so emacs has to look for fonts, slowing it
;; down enough for deadline in force-output to trigger
do (format t "~c" (code-char (random 1000000)))
(force-output)))
dies fairly quickly with Help! 11 nested errors. SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded. (backtrace below)
It looks like defmethod stream-force-output :around ((stream slime-output-stream))
sets a 0.1 second deadline, then (indirectly) calls the output function from swank-repl::make-output-function, which calls send-to-emacs, which sometimes calls ping-pong, which calls receive-if, which grabs a mutex and does condition-wait.
If that wait takes long enough for the deadline to expire, sbcl tries to SIGNAL a DEADLINE-TIMEOUT, and then *BREAK-ON-SIGNALS* tries to invoke the slime debugger, which eventually calls RECEIVE-IF again, which tries to grab the same mutex, which signals a recursive lock error, starting a debugger loop until sbcl gives up.
Binding *break-on-signals* to NIL around the call to sb-thread:condition-wait in receive-if seems to avoid the problem, but no idea if that is the best (or even a valid) solution.
backtrace :
Help! 11 nested errors. SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded.
Backtrace for: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}>
0: ((FLET SB-IMPL::TRY-TO-INVOKE-DEBUGGER :IN SB-IMPL::ERROR-ERROR))
1: ((FLET "THUNK" :IN SB-IMPL::ERROR-ERROR))
2: (SB-IMPL::%WITH-STANDARD-IO-SYNTAX #<FUNCTION (FLET "THUNK" :IN SB-IMPL::ERROR-ERROR) {5429CCB}>)
3: (SB-IMPL::ERROR-ERROR "Help! " 11 " nested errors. " "SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded.")
4: (ERROR "Recursive lock attempt ~S." #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}>)
5: (SB-THREAD::%TRY-MUTEX #<unavailable argument> #<unavailable argument>)
6: (SB-THREAD:GRAB-MUTEX #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}> :WAITP T :TIMEOUT NIL)
7: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
8: ((FLET SWANK/BACKEND:RECEIVE-IF) #<FUNCTION (LAMBDA (SWANK::E) :IN SWANK::WAIT-FOR-EVENT) {1038A28FFB}> NIL)
9: (SWANK::WAIT-FOR-EVENT (OR (:EMACS-REX . SWANK::_) (:SLDB-RETURN 11)) NIL)
10: (SWANK::SLDB-LOOP 10)
11: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGING-ENVIRONMENT :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION (LAMBDA NIL :IN SWANK::DEBUG-IN-EMACS) {102107F93B}>)
12: (SWANK::DEBUG-IN-EMACS #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}>)
13: (SWANK:INVOKE-SLIME-DEBUGGER #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}>)
14: (SWANK/SBCL::CALL-WITH-BREAK-HOOK #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK:SWANK-DEBUGGER-HOOK) {10388894CB}>)
15: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGER-HOOK :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK:SWANK-DEBUGGER-HOOK) {10388894CB}>)
16: (SWANK:SWANK-DEBUGGER-HOOK #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}> #<unused argument>)
17: (SB-DEBUG::RUN-HOOK *DEBUGGER-HOOK* #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}>)
18: (INVOKE-DEBUGGER #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}>)
19: (ERROR "Recursive lock attempt ~S." #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}>)
20: (SB-THREAD::%TRY-MUTEX #<unavailable argument> #<unavailable argument>)
21: (SB-THREAD:GRAB-MUTEX #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}> :WAITP T :TIMEOUT NIL)
22: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
23: ((FLET SWANK/BACKEND:RECEIVE-IF) #<FUNCTION (LAMBDA (SWANK::E) :IN SWANK::WAIT-FOR-EVENT) {1038888FFB}> NIL)
24: (SWANK::WAIT-FOR-EVENT (OR (:EMACS-REX . SWANK::_) (:SLDB-RETURN 10)) NIL)
25: (SWANK::SLDB-LOOP 9)
;;; more sldb loops elided
146: (SWANK::SLDB-LOOP 1)
147: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGING-ENVIRONMENT :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION (LAMBDA NIL :IN SWANK::DEBUG-IN-EMACS) {102107F93B}>)
148: (SWANK::DEBUG-IN-EMACS #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
(now rebound to NIL)." {10378BC8A3}>)
149: (SWANK:INVOKE-SLIME-DEBUGGER #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
(now rebound to NIL)." {10378BC8A3}>)
150: (SWANK/SBCL::CALL-WITH-BREAK-HOOK #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK:SWANK-DEBUGGER-HOOK) {10378BD62B}>)
151: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGER-HOOK :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK:SWANK-DEBUGGER-HOOK) {10378BD62B}>)
152: (SWANK:SWANK-DEBUGGER-HOOK #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
(now rebound to NIL)." {10378BC8A3}> #<unused argument>)
153: (SB-DEBUG::RUN-HOOK *INVOKE-DEBUGGER-HOOK* #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
(now rebound to NIL)." {10378BC8A3}>)
154: (INVOKE-DEBUGGER #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
(now rebound to NIL)." {10378BC8A3}>)
155: (SB-INT:%BREAK BREAK "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
(now rebound to NIL)." #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
156: (BREAK "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
(now rebound to NIL)." #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
157: (SB-KERNEL::MAYBE-BREAK-ON-SIGNAL #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
158: (SB-KERNEL::%SIGNAL #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
159: (ERROR #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
160: (SB-KERNEL:WITH-SIMPLE-CONDITION-RESTARTS ERROR NIL SB-SYS:DEADLINE-TIMEOUT :SECONDS 0.1)
161: ((FLET SB-UNIX::BODY :IN SB-SYS:SIGNAL-DEADLINE))
162: (SB-SYS:SIGNAL-DEADLINE)
163: ((FLET "CLEANUP-FUN-8" :IN SB-THREAD::%CONDITION-WAIT)) [cleanup]
164: ((FLET "WITHOUT-INTERRUPTS-BODY-3" :IN SB-THREAD::%CONDITION-WAIT))
165: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE {10377FFF03}> #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}> :TIMEOUT NIL)
166: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
167: (SB-THREAD::CALL-WITH-MUTEX #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SWANK/BACKEND:RECEIVE-IF) {542EC4B}> #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}> T NIL)
168: ((FLET SWANK/BACKEND:RECEIVE-IF) #<FUNCTION (LAMBDA (SWANK::E) :IN SWANK::WAIT-FOR-EVENT) {10378BC6DB}> NIL)
169: (SWANK::WAIT-FOR-EVENT (:EMACS-PONG 102) NIL)
170: (SWANK::SEND-TO-EMACS (:WRITE-STRING ""))
171: ((LAMBDA NIL :IN STREAM-FINISH-OUTPUT))
172: ((FLET "WITHOUT-INTERRUPTS-BODY-12" :IN SB-THREAD::CALL-WITH-RECURSIVE-LOCK))
173: (SB-THREAD::CALL-WITH-RECURSIVE-LOCK #<FUNCTION (FLET SB-THREAD::WITH-RECURSIVE-LOCK-THUNK :IN SWANK/BACKEND:CALL-WITH-LOCK-HELD) {542EF0B}> #<SB-THREAD:MUTEX "buffer write lock" owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}>> T NIL)
174: ((FLET SWANK/BACKEND:CALL-WITH-LOCK-HELD :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<SB-THREAD:MUTEX "buffer write lock" owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}>> #<FUNCTION (LAMBDA NIL :IN STREAM-FINISH-OUTPUT) {10378BC4DB}>)
175: ((:METHOD STREAM-FINISH-OUTPUT (SWANK/GRAY::SLIME-OUTPUT-STREAM)) #<SWANK/GRAY::SLIME-OUTPUT-STREAM {10352C7AC3}>) [fast-method]
176: ((LABELS SB-IMPL::BIND-DEADLINE-AND-CALL :IN STREAM-FORCE-OUTPUT) #S(SB-IMPL::DEADLINE :INTERNAL-TIME 509573351 :SECONDS 0.1))
177: ((:METHOD STREAM-FORCE-OUTPUT :AROUND (SWANK/GRAY::SLIME-OUTPUT-STREAM)) #<SWANK/GRAY::SLIME-OUTPUT-STREAM {10352C7AC3}>) [fast-method]
178: (FORCE-OUTPUT #<SYNONYM-STREAM :SYMBOL SWANK::*CURRENT-STANDARD-OUTPUT* {10352D8F43}>)
179: ((LAMBDA ()))
180: (SB-INT:SIMPLE-EVAL-IN-LEXENV (LET ((*BREAK-ON-SIGNALS* T)) (LOOP REPEAT 10000 DO (FORMAT T "~c" (CODE-CHAR (RANDOM 1000000))) (FORCE-OUTPUT))) #<NULL-LEXENV>)
181: (EVAL (LET ((*BREAK-ON-SIGNALS* T)) (LOOP REPEAT 10000 DO (FORMAT T "~c" (CODE-CHAR (RANDOM 1000000))) (FORCE-OUTPUT))))
182: ((LAMBDA NIL :IN SWANK:INTERACTIVE-EVAL))
183: (SWANK::CALL-WITH-RETRY-RESTART "Retry SLIME interactive evaluation request." #<FUNCTION (LAMBDA NIL :IN SWANK:INTERACTIVE-EVAL) {10378500BB}>)
184: (SWANK::CALL-WITH-BUFFER-SYNTAX NIL #<FUNCTION (LAMBDA NIL :IN SWANK:INTERACTIVE-EVAL) {103785009B}>)
185: (SB-INT:SIMPLE-EVAL-IN-LEXENV (SWANK:INTERACTIVE-EVAL #<(SIMPLE-ARRAY CHARACTER (264)) (let ((*break-on-signals* t))
(loop repeat 10000
;; random chars so emacs has to look for fonts, slowing it
;; down enough for deadline in force-output to trigger
do (format ... {103784845F}>) #<NULL-LEXENV>)
186: (EVAL (SWANK:INTERACTIVE-EVAL #<(SIMPLE-ARRAY CHARACTER (264)) (let ((*break-on-signals* t))
(loop repeat 10000
;; random chars so emacs has to look for fonts, slowing it
;; down enough for deadline in force-output to trigger
do (format ... {103784845F}>))
187: (SWANK:EVAL-FOR-EMACS (SWANK:INTERACTIVE-EVAL #<(SIMPLE-ARRAY CHARACTER (264)) (let ((*break-on-signals* t))
(loop repeat 10000
;; random chars so emacs has to look for fonts, slowing it
;; down enough for deadline in force-output to trigger
do (format ... {103784845F}>) "COMMON-LISP-USER" 73)
188: ((LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD))
189: (SWANK/SBCL::CALL-WITH-BREAK-HOOK #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD) {1021035F8B}>)
190: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGER-HOOK :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD) {1021035F8B}>)
191: (SWANK::CALL-WITH-BINDINGS ((*STANDARD-INPUT* . #<SWANK/GRAY::SLIME-INPUT-STREAM {10351B5B13}>)) #<FUNCTION (LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD) {102103624B}>)
192: ((LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD))
193: ((FLET SB-UNIX::BODY :IN SB-THREAD::RUN))
194: ((FLET "WITHOUT-INTERRUPTS-BODY-9" :IN SB-THREAD::RUN))
195: (SB-THREAD::RUN #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> NIL #<FUNCTION (LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD) {10376BEA0B}> NIL)
196: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::START-THREAD))
197: ((FLET SB-THREAD::START-ROUTINE :IN SB-THREAD::START-THREAD))
198: ("foreign function: call_into_lisp")
199: ("foreign function: funcall0")
Thanks for the report and analysis. IIRC, there have been some improvements re SBCL's world lock. I wonder if the workaround is still necessary?
I doubt SBCL's world lock is involved. I think the only thing SBCL-specific is the presence of deadlines and mutexes, which aren't all that implementation specific.
The problem is that slime is non-reentrant at that part of RECEIVE-IF due to holding a mutex, and the combination of deadlines + *BREAK-ON-SIGNALS* causes it to try to re-enter that piece of code.
Possibly using a recursive lock there would also work, and/or possibly it just shouldn't use deadlines to turn finish-output into force-output.
User code possibly should also be more specific than T for the type of conditions to break on, but it would still be nice if the result was just having to hit C in a few extra debugger windows rather than breaking things completely.
I mention the world-lock because that's the stated rationale for that deadline/mutex scheme:
#+(and sbcl sb-thread)
(defmethod stream-force-output :around ((stream slime-output-stream))
;; Workaround for deadlocks between the world-lock and auto-flush-thread
;; buffer write lock.
;;
;; Another alternative would be to grab the world-lock here, but that's less
;; future-proof, and could introduce other lock-ordering issues in the
;; future.
(handler-case
(sb-sys:with-deadline (:seconds 0.1)
(call-next-method))
(sb-sys:deadline-timeout ()
nil)))```
ah, missed that comment, that's possible.
@3b interrupting the output function (and the buffer flushing method that updates the fill pointer after calling the output function) with a deadline seems like a bad idea. All sorts of things could go wrong there.
possibly it just shouldn't use deadlines to turn
finish-outputintoforce-output.
Do you mean that stream-force-output could simply call maybe-schedule-flush? It'd need to grab the buffer write lock, so it'd trip on the same issue described in 1f8d15f7e117e55a3b558d741f7f56c817820f51.
Occasionally auto-flush thread could deadlock with a thread doing
compilation:
T1: grabs world-lock
T2: decides to flush, grabs stream lock
T1: needs to do output (compiler note, etc), waits on the stream lock
T2: needs the world-lock for a while (PCL dispatch machinery setup, etc)
*boink*
Which brings us to my main question, can we reproduce this deadlock?
I've locally removed the :around stream-force-output method and tried to reproduce the deadlock using:
(values
(sb-thread:make-thread (lambda () (loop (swank/gray::maybe-schedule-flush swank::*current-standard-output*))))
(sb-thread:make-thread (lambda () (loop (compile-file "/path/to/slime/swank/gray.lisp")))))
Picked gray.lisp with the vague idea that it'd invalidate CLOS caches relevant to stream-force-output. No deadlock there.
Tried something more to the point:
(values
(sb-thread:make-thread
(lambda ()
(sb-kernel:with-world-lock ()
(sleep 5)
(write-line "T1: hello")
(force-output))))
(sb-thread:make-thread
(lambda ()
(sleep 1)
(write-line "T2: hello"))))
#<SB-THREAD:THREAD RUNNING {1008B62C53}>
#<SB-THREAD:THREAD RUNNING {1008B8C6D3}>
T2: hello
T1: hello
What happens there is:
- T1: grabs world lock and sleeps for 5 seconds.
- T2: after sleeping for 1 second, writes "T2: hello" which triggers a message to the auto-flush-thread.
- auto-flush-thread: gets the message, sleeps for 80 ms, calls force-output which doesn't grab the world lock, therefore succeeds in flushing the output to Emacs.
- T1: writes to the stream and explicitly forces output, which grabs the stream lock and would deadlock if auto-flush-thread were waiting on the world lock.
Even if auto-flush-thread did grab the world lock (as it calls a bunch of generic functions before forcing output), it would only cause a deadlock if it tried to grab the world lock after it grabs the stream lock in stream-finish-output, at which point I don't see any CLOS being used (but maybe I didn't look hard enough).
I tried various random things like invoking reinitialize-instance on methods that auto-flush-thread uses, but couldn't get it to grab the world lock.
Therefore, I'm inclined to remove the :around method. What do you think?
Do you mean that
stream-force-outputcould simply call maybe-schedule-flush? It'd need to grab the buffer write lock, so it'd trip on the same issue described in 1f8d15f.
I don't mean anything that specific, since I have no idea what the code is doing internally :)
It just seems to me that force-output is conceptually "make sure any previously sent output gets scheduled for display" and finish-output is "do that and then wait for it to finish", so implementing force-output as "do that and then wait for it to finish but maybe interrupt the waiting" instead of just "do that" seems to indicate something wrong somewhere.
I tried various random things like invoking
reinitialize-instanceon methods that auto-flush-thread uses, but couldn't get it to grab the world lock.Therefore, I'm inclined to remove the
:aroundmethod. What do you think?
I'd probably at least try running on older SBCL first, and see if any of the tests can trigger problems there. If so, removing it sounds reasonable, though I'd probably still leave it enabled for old SBCL. Maybe also check other combinations of connection/stream settings.
I'd probably at least try running on older SBCL first, and see if any of the tests can trigger problems there. If so, removing it sounds reasonable, though I'd probably still leave it enabled for old SBCL. Maybe also check other combinations of connection/stream settings.
That sounds sensible, yes.