slime icon indicating copy to clipboard operation
slime copied to clipboard

recursive error from FORCE-OUTPUT on SLIME-OUTPUT-STREAM with *BREAK-ON-SIGNALS* set to T

Open 3b opened this issue 4 years ago • 8 comments

on: sbcl 2.1.3.122.master.5-177bc6b03, x8664 windows GNU Emacs 27.1 (build 1, x86_64-w64-mingw32) of 2020-11-19 slime-v2.26.1

(let ((*break-on-signals* t))
  (loop repeat 500
        ;; random chars so emacs has to look for fonts, slowing it
        ;; down enough for deadline in force-output to trigger
        do (format t "~c" (code-char (random 1000000)))
           (force-output)))

dies fairly quickly with Help! 11 nested errors. SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded. (backtrace below)

It looks like defmethod stream-force-output :around ((stream slime-output-stream)) sets a 0.1 second deadline, then (indirectly) calls the output function from swank-repl::make-output-function, which calls send-to-emacs, which sometimes calls ping-pong, which calls receive-if, which grabs a mutex and does condition-wait.

If that wait takes long enough for the deadline to expire, sbcl tries to SIGNAL a DEADLINE-TIMEOUT, and then *BREAK-ON-SIGNALS* tries to invoke the slime debugger, which eventually calls RECEIVE-IF again, which tries to grab the same mutex, which signals a recursive lock error, starting a debugger loop until sbcl gives up.

Binding *break-on-signals* to NIL around the call to sb-thread:condition-wait in receive-if seems to avoid the problem, but no idea if that is the best (or even a valid) solution.

backtrace :

Help! 11 nested errors. SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded.
Backtrace for: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}>
0: ((FLET SB-IMPL::TRY-TO-INVOKE-DEBUGGER :IN SB-IMPL::ERROR-ERROR))
1: ((FLET "THUNK" :IN SB-IMPL::ERROR-ERROR))
2: (SB-IMPL::%WITH-STANDARD-IO-SYNTAX #<FUNCTION (FLET "THUNK" :IN SB-IMPL::ERROR-ERROR) {5429CCB}>)
3: (SB-IMPL::ERROR-ERROR "Help! " 11 " nested errors. " "SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded.")
4: (ERROR "Recursive lock attempt ~S." #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}>)
5: (SB-THREAD::%TRY-MUTEX #<unavailable argument> #<unavailable argument>)
6: (SB-THREAD:GRAB-MUTEX #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}> :WAITP T :TIMEOUT NIL)
7: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
8: ((FLET SWANK/BACKEND:RECEIVE-IF) #<FUNCTION (LAMBDA (SWANK::E) :IN SWANK::WAIT-FOR-EVENT) {1038A28FFB}> NIL)
9: (SWANK::WAIT-FOR-EVENT (OR (:EMACS-REX . SWANK::_) (:SLDB-RETURN 11)) NIL)
10: (SWANK::SLDB-LOOP 10)

11: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGING-ENVIRONMENT :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION (LAMBDA NIL :IN SWANK::DEBUG-IN-EMACS) {102107F93B}>)
12: (SWANK::DEBUG-IN-EMACS #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}>)
13: (SWANK:INVOKE-SLIME-DEBUGGER #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}>)
14: (SWANK/SBCL::CALL-WITH-BREAK-HOOK #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK:SWANK-DEBUGGER-HOOK) {10388894CB}>)
15: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGER-HOOK :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK:SWANK-DEBUGGER-HOOK) {10388894CB}>)
16: (SWANK:SWANK-DEBUGGER-HOOK #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}> #<unused argument>)
17: (SB-DEBUG::RUN-HOOK *DEBUGGER-HOOK* #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}>)
18: (INVOKE-DEBUGGER #<SIMPLE-ERROR "Recursive lock attempt ~S." {1038889033}>)
19: (ERROR "Recursive lock attempt ~S." #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}>)
20: (SB-THREAD::%TRY-MUTEX #<unavailable argument> #<unavailable argument>)
21: (SB-THREAD:GRAB-MUTEX #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}> :WAITP T :TIMEOUT NIL)
22: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
23: ((FLET SWANK/BACKEND:RECEIVE-IF) #<FUNCTION (LAMBDA (SWANK::E) :IN SWANK::WAIT-FOR-EVENT) {1038888FFB}> NIL)
24: (SWANK::WAIT-FOR-EVENT (OR (:EMACS-REX . SWANK::_) (:SLDB-RETURN 10)) NIL)
25: (SWANK::SLDB-LOOP 9)

;;; more sldb loops elided

146: (SWANK::SLDB-LOOP 1)
147: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGING-ENVIRONMENT :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION (LAMBDA NIL :IN SWANK::DEBUG-IN-EMACS) {102107F93B}>)
148: (SWANK::DEBUG-IN-EMACS #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
                    (now rebound to NIL)." {10378BC8A3}>)
149: (SWANK:INVOKE-SLIME-DEBUGGER #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
                    (now rebound to NIL)." {10378BC8A3}>)
150: (SWANK/SBCL::CALL-WITH-BREAK-HOOK #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK:SWANK-DEBUGGER-HOOK) {10378BD62B}>)
151: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGER-HOOK :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK:SWANK-DEBUGGER-HOOK) {10378BD62B}>)
152: (SWANK:SWANK-DEBUGGER-HOOK #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
                    (now rebound to NIL)." {10378BC8A3}> #<unused argument>)
153: (SB-DEBUG::RUN-HOOK *INVOKE-DEBUGGER-HOOK* #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
                    (now rebound to NIL)." {10378BC8A3}>)
154: (INVOKE-DEBUGGER #<SIMPLE-CONDITION "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
                    (now rebound to NIL)." {10378BC8A3}>)
155: (SB-INT:%BREAK BREAK "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
                    (now rebound to NIL)." #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
156: (BREAK "~A~%BREAK was entered because of *BREAK-ON-SIGNALS* ~
                    (now rebound to NIL)." #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
157: (SB-KERNEL::MAYBE-BREAK-ON-SIGNAL #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
158: (SB-KERNEL::%SIGNAL #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
159: (ERROR #<SB-SYS:DEADLINE-TIMEOUT {10378BC763}>)
160: (SB-KERNEL:WITH-SIMPLE-CONDITION-RESTARTS ERROR NIL SB-SYS:DEADLINE-TIMEOUT :SECONDS 0.1)
161: ((FLET SB-UNIX::BODY :IN SB-SYS:SIGNAL-DEADLINE))
162: (SB-SYS:SIGNAL-DEADLINE)
163: ((FLET "CLEANUP-FUN-8" :IN SB-THREAD::%CONDITION-WAIT)) [cleanup]
164: ((FLET "WITHOUT-INTERRUPTS-BODY-3" :IN SB-THREAD::%CONDITION-WAIT))
165: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE  {10377FFF03}> #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}> :TIMEOUT NIL)
166: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
167: (SB-THREAD::CALL-WITH-MUTEX #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SWANK/BACKEND:RECEIVE-IF) {542EC4B}> #<SB-THREAD:MUTEX owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> {10377FFED3}> T NIL)
168: ((FLET SWANK/BACKEND:RECEIVE-IF) #<FUNCTION (LAMBDA (SWANK::E) :IN SWANK::WAIT-FOR-EVENT) {10378BC6DB}> NIL)
169: (SWANK::WAIT-FOR-EVENT (:EMACS-PONG 102) NIL)
170: (SWANK::SEND-TO-EMACS (:WRITE-STRING "򓄳"))
171: ((LAMBDA NIL :IN STREAM-FINISH-OUTPUT))
172: ((FLET "WITHOUT-INTERRUPTS-BODY-12" :IN SB-THREAD::CALL-WITH-RECURSIVE-LOCK))
173: (SB-THREAD::CALL-WITH-RECURSIVE-LOCK #<FUNCTION (FLET SB-THREAD::WITH-RECURSIVE-LOCK-THUNK :IN SWANK/BACKEND:CALL-WITH-LOCK-HELD) {542EF0B}> #<SB-THREAD:MUTEX "buffer write lock" owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}>> T NIL)
174: ((FLET SWANK/BACKEND:CALL-WITH-LOCK-HELD :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<SB-THREAD:MUTEX "buffer write lock" owner: #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}>> #<FUNCTION (LAMBDA NIL :IN STREAM-FINISH-OUTPUT) {10378BC4DB}>)
175: ((:METHOD STREAM-FINISH-OUTPUT (SWANK/GRAY::SLIME-OUTPUT-STREAM)) #<SWANK/GRAY::SLIME-OUTPUT-STREAM {10352C7AC3}>) [fast-method]
176: ((LABELS SB-IMPL::BIND-DEADLINE-AND-CALL :IN STREAM-FORCE-OUTPUT) #S(SB-IMPL::DEADLINE :INTERNAL-TIME 509573351 :SECONDS 0.1))
177: ((:METHOD STREAM-FORCE-OUTPUT :AROUND (SWANK/GRAY::SLIME-OUTPUT-STREAM)) #<SWANK/GRAY::SLIME-OUTPUT-STREAM {10352C7AC3}>) [fast-method]
178: (FORCE-OUTPUT #<SYNONYM-STREAM :SYMBOL SWANK::*CURRENT-STANDARD-OUTPUT* {10352D8F43}>)
179: ((LAMBDA ()))
180: (SB-INT:SIMPLE-EVAL-IN-LEXENV (LET ((*BREAK-ON-SIGNALS* T)) (LOOP REPEAT 10000 DO (FORMAT T "~c" (CODE-CHAR (RANDOM 1000000))) (FORCE-OUTPUT))) #<NULL-LEXENV>)
181: (EVAL (LET ((*BREAK-ON-SIGNALS* T)) (LOOP REPEAT 10000 DO (FORMAT T "~c" (CODE-CHAR (RANDOM 1000000))) (FORCE-OUTPUT))))
182: ((LAMBDA NIL :IN SWANK:INTERACTIVE-EVAL))
183: (SWANK::CALL-WITH-RETRY-RESTART "Retry SLIME interactive evaluation request." #<FUNCTION (LAMBDA NIL :IN SWANK:INTERACTIVE-EVAL) {10378500BB}>)
184: (SWANK::CALL-WITH-BUFFER-SYNTAX NIL #<FUNCTION (LAMBDA NIL :IN SWANK:INTERACTIVE-EVAL) {103785009B}>)
185: (SB-INT:SIMPLE-EVAL-IN-LEXENV (SWANK:INTERACTIVE-EVAL #<(SIMPLE-ARRAY CHARACTER (264)) (let ((*break-on-signals* t))
  (loop repeat 10000
        ;; random chars so emacs has to look for fonts, slowing it
        ;; down enough for deadline in force-output to trigger
        do (format ... {103784845F}>) #<NULL-LEXENV>)
186: (EVAL (SWANK:INTERACTIVE-EVAL #<(SIMPLE-ARRAY CHARACTER (264)) (let ((*break-on-signals* t))
  (loop repeat 10000
        ;; random chars so emacs has to look for fonts, slowing it
        ;; down enough for deadline in force-output to trigger
        do (format ... {103784845F}>))
187: (SWANK:EVAL-FOR-EMACS (SWANK:INTERACTIVE-EVAL #<(SIMPLE-ARRAY CHARACTER (264)) (let ((*break-on-signals* t))
  (loop repeat 10000
        ;; random chars so emacs has to look for fonts, slowing it
        ;; down enough for deadline in force-output to trigger
        do (format ... {103784845F}>) "COMMON-LISP-USER" 73)
188: ((LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD))
189: (SWANK/SBCL::CALL-WITH-BREAK-HOOK #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD) {1021035F8B}>)
190: ((FLET SWANK/BACKEND:CALL-WITH-DEBUGGER-HOOK :IN "c:/msys64/home/b/quicklisp/dists/quicklisp/software/slime-v2.26.1/swank/sbcl.lisp") #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD) {1021035F8B}>)
191: (SWANK::CALL-WITH-BINDINGS ((*STANDARD-INPUT* . #<SWANK/GRAY::SLIME-INPUT-STREAM {10351B5B13}>)) #<FUNCTION (LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD) {102103624B}>)
192: ((LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD))
193: ((FLET SB-UNIX::BODY :IN SB-THREAD::RUN))
194: ((FLET "WITHOUT-INTERRUPTS-BODY-9" :IN SB-THREAD::RUN))
195: (SB-THREAD::RUN #<SB-THREAD:THREAD "worker" RUNNING {10376BEAF3}> NIL #<FUNCTION (LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD) {10376BEA0B}> NIL)
196: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::START-THREAD))
197: ((FLET SB-THREAD::START-ROUTINE :IN SB-THREAD::START-THREAD))
198: ("foreign function: call_into_lisp")
199: ("foreign function: funcall0")

3b avatar May 02 '21 23:05 3b

Thanks for the report and analysis. IIRC, there have been some improvements re SBCL's world lock. I wonder if the workaround is still necessary?

luismbo avatar May 04 '21 20:05 luismbo

I doubt SBCL's world lock is involved. I think the only thing SBCL-specific is the presence of deadlines and mutexes, which aren't all that implementation specific.

The problem is that slime is non-reentrant at that part of RECEIVE-IF due to holding a mutex, and the combination of deadlines + *BREAK-ON-SIGNALS* causes it to try to re-enter that piece of code.

Possibly using a recursive lock there would also work, and/or possibly it just shouldn't use deadlines to turn finish-output into force-output.

User code possibly should also be more specific than T for the type of conditions to break on, but it would still be nice if the result was just having to hit C in a few extra debugger windows rather than breaking things completely.

3b avatar May 08 '21 18:05 3b

I mention the world-lock because that's the stated rationale for that deadline/mutex scheme:

#+(and sbcl sb-thread)
(defmethod stream-force-output :around ((stream slime-output-stream))
  ;; Workaround for deadlocks between the world-lock and auto-flush-thread
  ;; buffer write lock.
  ;;
  ;; Another alternative would be to grab the world-lock here, but that's less
  ;; future-proof, and could introduce other lock-ordering issues in the
  ;; future.
  (handler-case
      (sb-sys:with-deadline (:seconds 0.1)
        (call-next-method))
    (sb-sys:deadline-timeout ()
      nil)))```

luismbo avatar May 09 '21 20:05 luismbo

ah, missed that comment, that's possible.

3b avatar May 09 '21 22:05 3b

@3b interrupting the output function (and the buffer flushing method that updates the fill pointer after calling the output function) with a deadline seems like a bad idea. All sorts of things could go wrong there.

luismbo avatar May 10 '21 10:05 luismbo

possibly it just shouldn't use deadlines to turn finish-output into force-output.

Do you mean that stream-force-output could simply call maybe-schedule-flush? It'd need to grab the buffer write lock, so it'd trip on the same issue described in 1f8d15f7e117e55a3b558d741f7f56c817820f51.

  Occasionally auto-flush thread could deadlock with a thread doing
  compilation:

    T1: grabs world-lock
    T2: decides to flush, grabs stream lock
    T1: needs to do output (compiler note, etc), waits on the stream lock
    T2: needs the world-lock for a while (PCL dispatch machinery setup, etc)

    *boink*

Which brings us to my main question, can we reproduce this deadlock?

I've locally removed the :around stream-force-output method and tried to reproduce the deadlock using:

(values
  (sb-thread:make-thread (lambda () (loop (swank/gray::maybe-schedule-flush swank::*current-standard-output*))))
  (sb-thread:make-thread (lambda () (loop (compile-file "/path/to/slime/swank/gray.lisp")))))

Picked gray.lisp with the vague idea that it'd invalidate CLOS caches relevant to stream-force-output. No deadlock there.

Tried something more to the point:

(values
 (sb-thread:make-thread
  (lambda ()
    (sb-kernel:with-world-lock ()
      (sleep 5)
      (write-line "T1: hello")
      (force-output))))
 (sb-thread:make-thread
  (lambda ()
    (sleep 1)
    (write-line "T2: hello"))))
#<SB-THREAD:THREAD RUNNING {1008B62C53}>
#<SB-THREAD:THREAD RUNNING {1008B8C6D3}>
T2: hello
T1: hello

What happens there is:

  1. T1: grabs world lock and sleeps for 5 seconds.
  2. T2: after sleeping for 1 second, writes "T2: hello" which triggers a message to the auto-flush-thread.
  3. auto-flush-thread: gets the message, sleeps for 80 ms, calls force-output which doesn't grab the world lock, therefore succeeds in flushing the output to Emacs.
  4. T1: writes to the stream and explicitly forces output, which grabs the stream lock and would deadlock if auto-flush-thread were waiting on the world lock.

Even if auto-flush-thread did grab the world lock (as it calls a bunch of generic functions before forcing output), it would only cause a deadlock if it tried to grab the world lock after it grabs the stream lock in stream-finish-output, at which point I don't see any CLOS being used (but maybe I didn't look hard enough).

I tried various random things like invoking reinitialize-instance on methods that auto-flush-thread uses, but couldn't get it to grab the world lock.

Therefore, I'm inclined to remove the :around method. What do you think?

luismbo avatar May 11 '21 14:05 luismbo

Do you mean that stream-force-output could simply call maybe-schedule-flush? It'd need to grab the buffer write lock, so it'd trip on the same issue described in 1f8d15f.

I don't mean anything that specific, since I have no idea what the code is doing internally :)

It just seems to me that force-output is conceptually "make sure any previously sent output gets scheduled for display" and finish-output is "do that and then wait for it to finish", so implementing force-output as "do that and then wait for it to finish but maybe interrupt the waiting" instead of just "do that" seems to indicate something wrong somewhere.

I tried various random things like invoking reinitialize-instance on methods that auto-flush-thread uses, but couldn't get it to grab the world lock.

Therefore, I'm inclined to remove the :around method. What do you think?

I'd probably at least try running on older SBCL first, and see if any of the tests can trigger problems there. If so, removing it sounds reasonable, though I'd probably still leave it enabled for old SBCL. Maybe also check other combinations of connection/stream settings.

3b avatar May 11 '21 20:05 3b

I'd probably at least try running on older SBCL first, and see if any of the tests can trigger problems there. If so, removing it sounds reasonable, though I'd probably still leave it enabled for old SBCL. Maybe also check other combinations of connection/stream settings.

That sounds sensible, yes.

luismbo avatar May 12 '21 07:05 luismbo