ccl
ccl copied to clipboard
'Waiting for recursive lock' crash (CCL1.11, CCL1.12 on ARM32)
Copy this code to /tmp/bug.lisp.
(in-package :cl-user)
(ql:quickload :bordeaux-threads)
(shadow 'terminate)
(ql:quickload :iterate)
(use-package :iterate)
(defun worker-read-byte (&aux file byte)
;; sharing can be (:private :lock :external)
(setf file (open "/tmp/bug.lisp" :element-type '(unsigned-byte 8)))
(setf byte (read-byte file))
(close file)
byte)
(defun main (n)
(let (threads)
(setf threads
(iter
(for i :from 1 :to n)
(for thread = (bt:make-thread #'(lambda () (worker-read-byte)) :name "tracking"))
(collect thread)))
(write
(iter
(for thread :in threads)
(collect (bt:join-thread thread))))
(terpri)))
(iter
(main 16)
(sleep 1))
After several iterations crashes:
Unhandled exception 4 at 0x105901f8, context->regs at #x75fe3838
? for help
[30173] Clozure CL kernel debugger:
Works fine on amd64 (but I need ARM32). If remove work with files then works fine, but I need a lot of streaming.
Didn't use WITH-OPEN-FILE to check is it argument :ABORT of function CLOSE. It can be NIL.
Tried every of (:private :lock (:external nil)) values to argument :sharing of function OPEN, because sometimes (with/out explicit JOIN-PROCESS) condition arises:
> Error: Current process #<PROCESS tracking(335) [Active] #x14E7FAF6> does not own lock #<RECURSIVE-LOCK [ptr @ #x12F7E20] #x140CF4D6>
> While executing: CCL::%UNLOCK-RECURSIVE-LOCK-OBJECT, in process tracking(335).
If I specify N different file names -- bug preserves.
If I enclose entire worker into lock -- bug preserves.
This sounds like a bug that we've struggled with on 32-bit ARM for a long time.
[from the mailing list]
A test case is here: https://github.com/jetmonk/openmcl-thread-test
Running (threadtest2) triggers a modified with-lock macro that grabs a lock, checks that the current process in fact owns it, runs body, and again checks for ownership before releasing. After many iterations, an error is thrown when the macro discovers that some other process stole the lock.
I thought it might be a lack of ARM DMB instructions ("ensures that all memory accesses are completed before new memory access is committed.") in atomic routines , which have been discussed as necessary for ARM mutexes elsewhere online, like
https://stackoverflow.com/questions/42698277/arm-cortex-m4-mutex-lock-dmb-instruction
Apparently the issue is different cores not realizing another core has changed memory until (DMB) forces synchronization.
But I tried stuffing (dmb) everywhere in ARM/arm-misc.lisp and it didn't fix the problem.
[From a folllowup post] Your use of "taskset" isn't right. You don't need "sudo", and you do need the "-a" option when you use it on a CCL instance that's already running. Alternatively, you can pin a CCL instance to a single core before it runs, by starting it with a command like "taskset 2 ccl".
On my pine64 device, running CCL 1.12.1 on Debian arm64 "testing", your THREADTEST2 crashes if I run CCL the normal way, but completes successfully if I run CCL on a single core via "taskset 2 ccl".
This is probably not a synchronization issue as we're experiencing the crashes on a single core arm32 system. Last known good version is 11.5.
@varjagg are you reporting that ccl 1.11.5 works OK, and you see the issue on later versions?
@xrme yep we've been testing our system on 11.5 for ages but as we tried to move to 1.12.2 it started to melt down. Same for any release between the two. At first we suspected our code but yesterday it fell apart just in swank interaction on a fresh image. But now that I reproduced it again am not even sure it's the same exact issue. The "Unhandled exception 4 at…" part is identical but it blows up on multiplication now:
? Unhandled exception 4 at 0x1a068, context->regs at #x75148258
? for help
[4097] Clozure CL kernel debugger: t
Current Thread Context Record (tcr) = 0x7693bf18
Control (C) stack area: low = 0x74fe6000, high = 0x7514a000
Value (lisp) stack area: low = 0x74eca000, high = 0x74fe4000
[4097] Clozure CL kernel debugger: b
current thread: tcr = 0x7693bf18, native thread ID = 0x12ef, interrupts enabled
(#x75148528) #x10024664 : #<Function MULTIPLY-BIGNUMS #x1400d50e> + 312
(#x75148538) #x108E3D34 : #<Function CURRENT-TIME-IN-NANOSECONDS #x143f0e3e> + 2
64
Then current-time-in-nanoseconds
is called from another function compiled with (optimize (speed 3) (space 0) (safety 1))
and inlined elsewhere. Though I believe this code does work most of the time so it's hardly a way to reproduce it.
Another backtrace. Now from socket code:
Unhandled exception 4 at 0x1a068, context->regs at #x745c3288 [820/1835]
? for help
[6539] Clozure CL kernel debugger: t
Current Thread Context Record (tcr) = 0x76505f40
Control (C) stack area: low = 0x74461000, high = 0x745c5000
Value (lisp) stack area: low = 0x73905000, high = 0x73a1f000
Exception stack pointer = 0x745c3558
[6539] Clozure CL kernel debugger: b current thread: tcr = 0x76505f40, native thread ID = 0x2034, interrupts disabled
(#x745C3558) #x10071BF4 : #<Function %CONS-NHASH-VECTOR #x14021376> + 176
(#x745C3568) #x100603B0 : #<Function %LOCK-FREE-REHASH #x1401e066> + 420
(#x745C3578) #x1005FE4C : #<Function LOCK-FREE-REHASH #x1401de6e> + 224
(#x745C3588) #x10062A28 : #<Function LOCK-FREE-PUTHASH #x1401e8ee> + 980
(#x745C3598) #x14D40A5C : #<Function ADD-WAITER #x14d40b1e> + 112
(#x745C35A8) #x14D40EEC : #<Function MAKE-WAIT-LIST #x14d40f1e> + 224 (#x745C35B8) #x14D3ED20 : #<Function WAIT-FOR-INPUT #x14d3f15e> + 340
(#x745C35C8) #x00018CAC : (subprimitive ret1valn)
…however, one of other threads appears to have bad frame:
TCR = 0x19d6d10, cstack area #x19eace0, native thread ID = 0x2033, interrupts enabled
Bad frame! (0x74333600)
I don't remember what ARM-specific changes took place from 1.11.5 to newer releases. Clearly something must have happened, though.
I wonder if there's some runtime (gc, exception handling) issue.
Signal 4 is SIGILL, by the way.
SIGILL? What instruction is the CPU executing at that moment? Is it possible to disassemble whatever is at the program counter at the time of the signal?
SIGILL is frequently used normally for GC and various other traps, so it's not surprising to get it.
The lisp kernel debugger can print out the registers (with "r"). We'd need to attach a debugger to find out more.
This is farily reproducible on our target so I'll get the register values tomorrow. Is there anything specific to instrument armcl for debugger?
If you end up in the lisp kernel debugger, you should be able to attach gdb.
https://trac.clozure.com/ccl/wiki/CclUnderGdb might help a little.
I think armcl is already compiled with -g
, but if it isn't, edit lisp-kernel/linuxarm/Makefile and rebuild the lisp kernel.
#454 is a bug on the ARM where calling (#_getpid)
from the top level crashes the lisp hard. But that happens on 1.11.5 too, so that may not be relevant here.
OK, so I poked around a bit. Not sure how helpful is it but there it goes:
Unhandled exception 4 at 0x1a068, context->regs at #x74ab3288
? for help
[31507] Clozure CL kernel debugger: t
Current Thread Context Record (tcr) = 0x76514c78
Control (C) stack area: low = 0x74951000, high = 0x74ab5000
Value (lisp) stack area: low = 0x74315000, high = 0x7442f000
Exception stack pointer = 0x74ab3558
[31507] Clozure CL kernel debugger: r
r00 = 0x0001544A r08 = 0x1401CE3E
r01 = 0x00000002 r09 = 0x00000033
r02 = 0x00000552 r10 = 0x7442EE7C
r03 = 0x76514C78 r11 = 0x00000000
r04 = 0x00000128 r12 = 0xFFFFFFF8
r05 = 0x00000550 r13 = 0x74AB3558
r06 = 0x00000550 r14 = 0x0001A37C
r07 = 0x18120000 r15 = 0x0001A068
[31507] Clozure CL kernel debugger: b
current thread: tcr = 0x76514c78, native thread ID = 0x498, interrupts disabled
(#x74AB3558) #x10071BF4 : #<Function %CONS-NHASH-VECTOR #x14021376> + 176
(#x74AB3568) #x100603B0 : #<Function %LOCK-FREE-REHASH #x1401e066> + 420
(#x74AB3578) #x1005FE4C : #<Function LOCK-FREE-REHASH #x1401de6e> + 224
(#x74AB3588) #x10062A28 : #<Function LOCK-FREE-PUTHASH #x1401e8ee> + 980
(#x74AB3598) #x14D34E7C : #<Function ADD-WAITER #x14d34f3e> + 112
(#x74AB35A8) #x14D3530C : #<Function MAKE-WAIT-LIST #x14d3533e> + 224
(#x74AB35B8) #x14D33140 : #<Function WAIT-FOR-INPUT #x14d3357e> + 340
(#x74AB35C8) #x00018CAC : (subprimitive ret1valn)
### attaching GDB
(gdb) info threads
…
### (0x498 is 1176)
16 LWP 1176 "Background proc" 0x76df59a4 in read () from /lib/libc.so.6
…
(gdb) thread 16
[Switching to thread 16 (LWP 1176)]
#0 0x76df59a4 in read () from /lib/libc.so.6
(gdb) bt
#0 0x76df59a4 in read () from /lib/libc.so.6
#1 0x76d9df18 in _IO_file_underflow () from /lib/libc.so.6
#2 0x76d9f068 in _IO_default_uflow () from /lib/libc.so.6
#3 0x76d991bc in getc () from /lib/libc.so.6
#4 0x0002a830 in readc () at ../lisp-debug.c:243
#5 0x0002afc8 in lisp_Debugger (xp=xp@entry=0x74ab3268,
info=info@entry=0x74ab31e8, why=why@entry=4,
in_foreign_code=in_foreign_code@entry=0,
message=0x74ab2fc8 "Unhandled exception 4 at 0x1a068, context->regs at #x74ab3288", message@entry=0x74ab2fc0 "h2\253t") at ../lisp-debug.c:1558
#6 0x00026d5c in signal_handler (signum=4, info=0x74ab31e8,
context=0x74ab3268) at ../arm-exceptions.c:1461
#7 <signal handler called>
#8 _SPmisc_alloc () at ../arm-spentry.s:2029
#9 0x0001a37c in _SPmisc_alloc_init () at ../arm-spentry.s:2267
#10 0x10071bf4 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0 0x0 0
r1 0x765127f0 1985030128
r2 0x400 1024
r3 0x0 0
r4 0x400 1024
r5 0x765127f0 1985030128
r6 0x0 0
r7 0x3 3
r8 0x2f09c 192668
r9 0x40898 264344
r10 0x0 0
r11 0x4095c 264540
r12 0x0 0
sp 0x74ab2f08 0x74ab2f08
lr 0x0 0
pc 0x76df59a4 0x76df59a4 <read+96>
cpsr 0x800f0010 -2146500592
(gdb) frame 6
#6 0x00026d5c in signal_handler (signum=4, info=0x74ab31e8,
context=0x74ab3268) at ../arm-exceptions.c:1461
warning: Source file is more recent than executable.
1461 if (lisp_Debugger(context, info, signum, (old_valence != TCR_STATE_LISP), msg)) {
(gdb) info frame
Stack level 6, frame at 0x74ab31e8:
pc = 0x26d5c in signal_handler (../arm-exceptions.c:1461);
saved pc = 0x76d62d30
called by frame at 0x74ab31e8, caller of frame at 0x74ab2fb8
source language c.
Arglist at 0x74ab2fb8, args: signum=4, info=0x74ab31e8, context=0x74ab3268
Locals at 0x74ab2fb8, Previous frame's sp is 0x74ab31e8
Saved registers:
r4 at 0x74ab31c8, r5 at 0x74ab31cc, r6 at 0x74ab31d0, r7 at 0x74ab31d4,
r8 at 0x74ab31d8, r9 at 0x74ab31dc, r10 at 0x74ab31e0, lr at 0x74ab31e4
(gdb) frame 8
#8 _SPmisc_alloc () at ../arm-spentry.s:2029
warning: Source file is more recent than executable.
2029 __(Misc_Alloc(arg_z,imm0,imm2))
(gdb) info frame
Stack level 8, frame at 0x74ab3558:
pc = 0x1a068 in _SPmisc_alloc (../arm-spentry.s:2029); saved pc = 0x1a37c
called by frame at 0x74ab3568, caller of frame at 0x74ab31e8
source language asm.
Arglist at 0x74ab3558, args:
Locals at 0x74ab3558, Previous frame's sp is 0x74ab3558
(gdb) x/i 0x1a068
0x1a068 <_SPmisc_alloc+120>: sub r12, r12, r2
(gdb) x/i 0x1a37c
0x1a37c <_SPmisc_alloc_init+28>: pop {r0, r10, r11, lr}
(gdb) frame 9
#9 0x0001a37c in _SPmisc_alloc_init () at ../arm-spentry.s:2267
2267 __(bl _SPmisc_alloc)
(gdb) info frame
Stack level 9, frame at 0x74ab3568:
pc = 0x1a37c in _SPmisc_alloc_init (../arm-spentry.s:2267);
saved pc = 0x10071bf4
called by frame at 0x74ab3568, caller of frame at 0x74ab3558
source language asm.
Arglist at 0x74ab3558, args:
Locals at 0x74ab3558, Previous frame's sp is 0x74ab3568
Saved registers:
r0 at 0x74ab3558, r10 at 0x74ab355c, r11 at 0x74ab3560, lr at 0x74ab3564
(gdb) x/i 0x10071bf4
0x10071bf4: str r4, [r10, #-4]!
(gdb) frame 10
#10 0x10071bf4 in ?? ()
(gdb) info frame
Stack level 10, frame at 0x74ab3568:
pc = 0x10071bf4; saved pc = <not saved>
Outermost frame: previous frame identical to this frame (corrupt stack?)
caller of frame at 0x74ab3568
Arglist at 0x74ab3568, args:
Locals at 0x74ab3568, Previous frame's sp is 0x74ab3568
(gdb) disas 0x1a37c
Dump of assembler code for function _SPmisc_alloc_init:
0x0001a360 <+0>: mov r0, #19
0x0001a364 <+4>: push {r0, r10, r11, lr}
0x0001a368 <+8>: mov r11, #0
0x0001a36c <+12>: mov r9, r4
0x0001a370 <+16>: mov r4, r5
0x0001a374 <+20>: mov r5, r6
0x0001a378 <+24>: bl 0x19ff0 <_SPmisc_alloc>
0x0001a37c <+28>: pop {r0, r10, r11, lr}
0x0001a380 <+32>: mov r5, r9
0x0001a384 <+36>: mov r0, #205 ; 0xcd
0x0001a388 <+40>: orr r0, r0, #512 ; 0x200
0x0001a38c <+44>: add r8, r0, #67108865 ; 0x4000001
0x0001a390 <+48>: mov r2, #8
0x0001a394 <+52>: ldr r9, [r8, #6]
0x0001a398 <+56>: ldr pc, [r9, #-2]
End of assembler dump.
(gdb) disas 0x1a068
Dump of assembler code for function _SPmisc_alloc:
0x00019ff0 <+0>: tst r5, #-67108861 ; 0xfc000003
0x00019ff4 <+4>: bne 0x1a08c <_SPmisc_alloc+156>
0x00019ff8 <+8>: asr r0, r4, #2
0x00019ffc <+12>: orr r0, r0, r5, lsl #6
0x0001a000 <+16>: and r1, r0, #7
0x0001a004 <+20>: cmp r1, #2
0x0001a008 <+24>: mov r2, r5
0x0001a00c <+28>: beq 0x1a05c <_SPmisc_alloc+108>
0x0001a010 <+32>: asr r1, r4, #2
0x0001a014 <+36>: cmp r1, #191 ; 0xbf
0x0001a018 <+40>: ble 0x1a05c <_SPmisc_alloc+108>
0x0001a01c <+44>: lsr r2, r5, #2
0x0001a020 <+48>: cmp r1, #207 ; 0xcf
0x0001a024 <+52>: ble 0x1a05c <_SPmisc_alloc+108>
0x0001a028 <+56>: lsr r2, r5, #1
0x0001a02c <+60>: cmp r1, #223 ; 0xdf
0x0001a030 <+64>: ble 0x1a05c <_SPmisc_alloc+108>
0x0001a034 <+68>: cmp r1, #247 ; 0xf7
0x0001a038 <+72>: lsleq r2, r5, #2
0x0001a03c <+76>: addeq r2, r2, #4
0x0001a040 <+80>: beq 0x1a05c <_SPmisc_alloc+108>
0x0001a044 <+84>: cmp r1, #255 ; 0xff
0x0001a048 <+88>: addeq r2, r5, #28
0x0001a04c <+92>: lsreq r2, r2, #5
0x0001a050 <+96>: beq 0x1a05c <_SPmisc_alloc+108>
0x0001a054 <+100>: lsl r2, r5, #1
0x0001a058 <+104>: add r2, r2, #4
0x0001a05c <+108>: add r2, r2, #11
0x0001a060 <+112>: bic r2, r2, #7
0x0001a064 <+116>: sub r2, r2, #6
0x0001a068 <+120>: sub r12, r12, r2
0x0001a06c <+124>: ldr r7, [r3, #104] ; 0x68
0x0001a070 <+128>: cmp r12, r7
0x0001a074 <+132>: bhi 0x1a07c <_SPmisc_alloc+140>
0x0001a078 <+136>: udf #0
0x0001a07c <+140>: str r0, [r12, #-6]
0x0001a080 <+144>: mov r4, r12
0x0001a084 <+148>: bic r12, r12, #7
0x0001a088 <+152>: bx lr
0x0001a08c <+156>: mov r6, #308 ; 0x134
0x0001a090 <+160>: mov r2, #12
0x0001a094 <+164>: b 0x198a4 <_SPksignalerr>
(gdb) disas 0x10071bd4,+0x30
Dump of assembler code from 0x10071bd4 to 0x10071c04:
0x10071bd4: b 0x10071be0
0x10071bd8: ldr lr, [r3, #260] ; 0x104
0x10071bdc: blx lr
0x10071be0: mov r6, r4
0x10071be4: mov r4, #51 ; 0x33
0x10071be8: mov r5, #296 ; 0x128
0x10071bec: ldr lr, [r3, #652] ; 0x28c
0x10071bf0: blx lr
0x10071bf4: str r4, [r10, #-4]!
0x10071bf8: mov r5, r4
0x10071bfc: ldr r4, [r10, #8]
0x10071c00: mov r2, #8