mps
mps copied to clipboard
Rare failure in amcssth in hot variety on Linux
The failure is very rare on x64-64, less than one in 100 runs. Here are 1000 runs of lii6gc/hot/amcssth, with 5 failures:
$ for _ in $(seq 1000); do lii6gc/hot/amcssth > /dev/null; done
amcssth: fmtdy.c:115: dylan_wrapper_check: Assertion `(w[WW] & 3) == 0' failed.
Aborted (core dumped)
amcssth: fmtdy.c:729: dylan_fwd: Assertion `(*(mps_word_t *)old & 3) == 0' failed.
Aborted (core dumped)
amcssth: fmtdy.c:117: dylan_wrapper_check: Assertion `ww[WW] == w[WW]' failed.
Aborted (core dumped)
amcssth: fmtdy.c:117: dylan_wrapper_check: Assertion `ww[WW] == w[WW]' failed.
Aborted (core dumped)
amcssth: fmtdy.c:117: dylan_wrapper_check: Assertion `ww[WW] == w[WW]' failed.
Aborted (core dumped)
I've also seen the assertion failure here:
amcssth: fmtdy.c:104: dylan_wrapper_check: Assertion `w != NULL' failed.
Notes about the envelope of the bug:
- Never occurs in the cool variety (not even on arm64)
- Much rarer on macOS
- Much more common on arm64
- Occurs more often if you have more cores
- Still occurs if we turn off garbage collection by parking the area.
Here's some analysis of what seems to be going on. The lack of locks in amcssth seems to be intentional: that is, the idea seems to be that even if there are races between the threads, then as long as updates of individual memory words are atomic, the program (and the MPS) will always see either the old pointer or the new pointer, and either case is valid. However, lockless programming is delicate and there are at least two ways that this could go wrong:
- You can't assume that looking up the same memory location twice in succession will yield same result. For example:
Here another thread might run after theif (exactRoots[i] != objNULL) cdie(dylan_check(exactRoots[i]), "dying root check");
exactRoots[i] != objNULL
test passes, settingexactRoots[i] = NULL
. The code must be rewritten to check the same object that was read:root = exactRoots[i]; if (root != objNULL) cdie(dylan_check(root), "dying root check");
- You can't assume that if one memory location was been updated before another in one thread, that the same order of updates will be visible in another thread. For example, if thread A executes:
Then in thread A the new object is initialized inexactRoots[i] = make(ap, roots_count);
make
beforeexactRoots[i]
is updated. But in the absence of memory barriers, it is possible for the updates to become visible in thread B in the other order: that is,exactRoots[i]
is updated before the new object is initialized and so thread B sees an uninitialized object inexactRoots[i]
.
exactRoots[i] = make(ap, roots_count);
This could be replaced with:
ambiguousRoot = make(ap, roots_count);
MEMORY_BARRIER();
exactRoots[i] = ambiguousRoot;
The MPS is fine with an ambiguous root pointing at anything, and will pin down even a partially initialized memory block. The barrier then ensures that the exact roots point to objects that have been initialised for every thread. C local variables are ambiguous roots because the stack is registered. I'm just making it clear.
Alternatively, we could put a barrier at the end of make
but that obscures the issue and is irrelevant when the new object is only assigned to an ambiguous reference.
@gareth-rees has attempted this in https://github.com/Ravenbrook/mps/commit/bc5f66c0306f853c736200e0dc7248f342b5ba17 but reports that there are still problems.
Note that adding a memory barrier after make
is not sufficient to avoid a memory ordering issue—you need a memory barrier on the read end of the data transfer as well as on the write end. Hence we also need to turn code like this:
if (exactRoots[i] != objNULL)
dylan_check(exactRoots[i]);
into
ambiguousRoot = exactRoots[i];
if (ambiguousRoot != objNULL) {
MEMORY_BARRIER();
dylan_check(ambiguousRoot);
}
to avoid getting the update to exactRoots[i]
before the update to the object.
Reproduction notes:
- Launch EC2 instance with Ubuntu on arm64, ami-0bbd1b7c1501b5858, on t4g.2xlarge.
-
sudo apt-get update
-
sudo apt-get install clang make gdb
-
git clone https://github.com/Ravenbrook/mps.git
-
cd mps
-
git switch branch/2021-01-16/arm64
-
cd code
-
make -f lia6ll.gmk VARIETY=hot amcssth
-
gdb lia6ll/hot/amcssth
-
handle SIGXFSZ nostop noprint
-
handle SIGXCPU nostop noprint
-
handle SIGSEGV nostop noprint
-
run
repeatedly until you get an abort (about 1 in 10 runs for me at 47d2571190e79615614bed370e87f36eefcb4b4d )
A .gdbinit file can be set up containing the handle commands.
Also in .gdbinit, a definition like this can be used to run the test repeatedly until it exits on a signal:
define rabbit
while 1
run > /dev/null
if $_siginfo
loop_break
end
end
end
Combining commands, you can get a quick recompile-and-go command line like this:
make -f lia6ll.gmk VARIETY=hot amcssth && gdb -ex rabbit lia6ll/hot/amcssth
Created branch/2021-01-25/amcssth-races to work on this.
I found it helpful to increase collectionsCOUNT
to several hundred at https://github.com/Ravenbrook/mps/blob/47d2571190e79615614bed370e87f36eefcb4b4d/code/amcssth.c#L32 and having done this and fixed up the two obvious races mentioned in https://github.com/Ravenbrook/mps/issues/59#issue-792585284 in https://github.com/Ravenbrook/mps/commit/199bdac2c81591f438b9834874becb73c2f62e36 I was able to get this interesting backtrace before I ran out of time today:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000fffff7da5d54 in __GI_abort () at abort.c:79
#2 0x0000fffff7db258c in __assert_fail_base (fmt=0xfffff7eaa158 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x498709 "ww[WW] == w[WW]", file=file@entry=0x4986a6 "fmtdy.c", line=line@entry=117,
function=function@entry=0x4986ae "int dylan_wrapper_check(mps_word_t *)") at assert.c:92
#3 0x0000fffff7db25f4 in __GI___assert_fail (assertion=0x498709 "ww[WW] == w[WW]", file=0x4986a6 "fmtdy.c", line=117,
function=0x4986ae "int dylan_wrapper_check(mps_word_t *)") at assert.c:101
#4 0x0000000000402764 in dylan_wrapper_check (w=<optimized out>) at fmtdy.c:117
#5 0x0000000000403214 in dylan_skip (object=0xfffff4217ad8) at fmtdy.c:668
#6 0x000000000046fca8 in amcSegFix (seg=0xfffff43c1078, ss=0xfffff7485560, refIO=0xfffff7485408) at ./poolamc.c:1600
#7 0x0000000000436ecc in _mps_fix2 (mps_ss=<optimized out>, mps_ref_io=0xfffff41ebda0) at ./trace.c:1389
#8 0x0000000000402d2c in dylan_scan_contig (mps_ss=0xfffff7485568, base=<optimized out>, limit=0xfffff41ebdc0) at fmtdy.c:227
#9 dylan_scan1 (mps_ss=0xfffff7485568, object_io=0xfffff74854c8) at fmtdy.c:478
#10 0x00000000004033c8 in dylan_scan (mps_ss=0xfffff7485568, base=0xfffff41ebd90, limit=0xfffff41ebdf0) at fmtdy.c:527
#11 0x000000000046f824 in FormatScan (format=0xfffff7cc0270, ss=0xfffff7485560, base=0xfffff41ebcf8, limit=<optimized out>)
at ./format.c:220
#12 amcSegScan (totalReturn=0xfffff74855bc, seg=0xfffff43c3b40, ss=0xfffff7485560) at ./poolamc.c:1428
#13 0x000000000048f39c in traceScanSegRes (ts=1, rank=1, arena=0xfffff7ffa000, seg=0xfffff43c3b40) at ./trace.c:1148
#14 0x0000000000427478 in traceScanSeg (ts=1, rank=1, arena=0xfffff7ffa000, seg=0xfffff43c3b40) at ./trace.c:1223
#15 TraceAdvance (trace=0xfffff7ffaaa8) at ./trace.c:1664
#16 0x0000000000425b84 in TracePoll (workReturn=0xfffff74856f0, collectWorldReturn=<optimized out>, globals=<optimized out>,
collectWorldAllowed=<optimized out>) at ./trace.c:1785
#17 0x000000000040abb0 in ArenaPoll (globals=0xfffff7ffa008) at ./global.c:745
#18 0x000000000040bea0 in mps_ap_fill (p_o=0xfffff74858f8, mps_ap=0xfffff7cc14c8, size=16) at ./mpsi.c:1065
#19 0x0000000000402520 in make (ap=0xfffff7cc14c8, roots_count=180) at amcssth.c:66
#20 0x00000000004023cc in churn (ap=0xfffff7cc14c8, roots_count=180) at amcssth.c:114
#21 0x0000000000402308 in kid_thread (arg=0xfffffffff530) at amcssth.c:144
#22 0x0000fffff7ef93c8 in start_thread (arg=0xfffffffff44f) at pthread_create.c:477
#23 0x0000fffff7e54f0c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
A couple of notes:
-
See commit bc5f66c for more races that need fixing -- basically every time amcssth.c reads a root from
exactRoots[i]
and then does something to the object pointed to by the root, then you need a barrier. -
There are races in
dylan_make_wrappers
:if(ww == NULL) { ww = malloc(sizeof(mps_word_t) * (BASIC_WRAPPER_SIZE + 1));
This could go wrong in two ways: there could be two threads that both pass the
ww == NULL
test and start making wrappers; or there could be a thread that reads the wrapper before it is initialized. (The easiest way to fix this is to calldylan_make_wrappers
from the main thread, before starting the child threads.) -
I used
__sync_synchronize
(which implements a full memory barrier) for simplicity, but for more fine-grained concurrency, we should use the Built-in Functions for Memory Model Aware Atomic Operations -- for reading a root we would use:__atomic_load(&exactRoots[i], &ambiguousRoot, __ATOMIC_ACQUIRE);
and for writing a root we would use:
__atomic_store(&exactRoots[i], &ambiguousRoot, __ATOMIC_RELEASE);
I think this gives the maximum safe amount of parallelism. But we shouldn't relax the memory order constraints until we've fixed all the races with full barriers.
See commit bbd9ce9ea for a version of the code with __atomic_load
and __atomic_store
instead of __sync_synchronize
(but using __ATOMIC_SEQ_CST
for full sequential consistency in each case, instead of the more relaxed ordering in my note above). This greatly reduces the frequency of failures, but lia6ll/hot/amcssth still fails once in several hundred runs. So there is another race or memory ordering issue that we missed.
2. There are races in `dylan_make_wrappers`
That looks like the sort of thing that would, in general, require a real lock. Simply ensuring it's called before we spawn threads is OK for our tests for now.
So there is another race or memory ordering issue that we missed.
Before I packed up yesterday I was looking at this one: https://github.com/Ravenbrook/mps/blob/bbd9ce9ea8ce5d1dc1093f8e6cd2f56dc91c2763/code/fmtdytst.c#L162 and no doubt there are similar races throughout fmtdytst.c to sort out.
A deeper question is whether fmtdy.c is safe, and by extension whether client formats are safe. It's somewhat theoretical because the way the MPS gets exclusive access to memory is likely to cause synchronisation, but we do not say so. In theory a format scan could see an object with out-of-order updates.
Just a note for anyone reading this thread later. @gareth-rees says above:
* Much more common on arm64
The reason we're running in to this now is because the ARM has relaxed memory consistency model (described very niceley here https://www.cs.utexas.edu/~bornholt/post/memory-models.html ) and we've been in an Intel monoculture for quite a few years. The recent work to port to arm64 has shaken out some sloppy work in developing this test. There hasn't been an easily-available platform to work on this problem until relatively recently.
We do not currently believe the MPS itself has any problems.
However, it's likely that client code using lockless concurrent updates to memory and roots managed by the MPS will need to be more careful, as demonstrated by the failures in this test.
Pull request #83 fixes all of the races that we have identified so far. I was able (after many attempts) to capture one of the unidentified races in LLDB. The backtrace in the asserting thread looks very similar to the one captured by rptb1 above:
* thread #12, stop reason = hit program assert
frame #0: 0x00000001a13121b0 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x00000001a1348cec libsystem_pthread.dylib`pthread_kill + 288
frame #2: 0x00000001a12822c8 libsystem_c.dylib`abort + 180
frame #3: 0x00000001a1281620 libsystem_c.dylib`__assert_rtn + 272
* frame #4: 0x0000000100098c10 amcssth`dylan_wrapper_check.cold.21 at fmtdy.c:117:3 [opt]
frame #5: 0x00000001000040f4 amcssth`dylan_wrapper_check(w=<unavailable>) at fmtdy.c:117:3 [opt]
frame #6: 0x000000010000474c amcssth`dylan_skip(object=0x00000001003998e8) at fmtdy.c:668:3 [opt]
frame #7: 0x00000001000709d0 amcssth`amcSegFix(seg=0x0000000100385698, ss=0x00000001703feb98, refIO=0x00000001703fea18) at poolamc.c:1600:15 [opt]
frame #8: 0x000000010003887c amcssth`_mps_fix2(mps_ss=<unavailable>, mps_ref_io=0x0000000100353520) at trace.c:1432:9 [opt]
frame #9: 0x0000000100004454 amcssth`dylan_scan1 at fmtdy.c:227:17 [opt]
frame #10: 0x0000000100004414 amcssth`dylan_scan1(mps_ss=0x00000001703feba0, object_io=<unavailable>) at fmtdy.c:478:13 [opt]
frame #11: 0x00000001000048bc amcssth`dylan_scan(mps_ss=0x00000001703feba0, base=0x0000000100353510, limit=0x0000000100353700) at fmtdy.c:527:11 [opt]
frame #12: 0x000000010007046c amcssth`amcSegScan [inlined] TraceScanFormat(ss=0x00000001703feb98, base=0x0000000100353510, limit=<unavailable>) at trace.c:1538:10 [opt]
frame #13: 0x0000000100070448 amcssth`amcSegScan(totalReturn=0x00000001703fec0c, seg=0x0000000100385870, ss=0x00000001703feb98) at poolamc.c:1428:11 [opt]
frame #14: 0x000000010008f7f8 amcssth`traceScanSegRes(ts=1, rank=<unavailable>, arena=0x0000000100138000, seg=0x0000000100385870) at trace.c:1204:11 [opt]
frame #15: 0x0000000100028714 amcssth`TraceAdvance [inlined] traceScanSeg(ts=1, rank=1, arena=0x0000000100138000, seg=0x0000000100385870) at trace.c:1266:9 [opt]
frame #16: 0x0000000100028700 amcssth`TraceAdvance(trace=0x0000000100138aa8) at trace.c:1727:13 [opt]
frame #17: 0x0000000100026e8c amcssth`TracePoll(workReturn=0x00000001703fed50, collectWorldReturn=0x00000001703fed5c, globals=0x0000000100138008, collectWorldAllowed=<unavailable>) at trace.c:1848:5 [opt]
frame #18: 0x000000010000be74 amcssth`ArenaPoll(globals=0x0000000100138008) at global.c:745:16 [opt]
frame #19: 0x000000010000d084 amcssth`mps_ap_fill(p_o=0x00000001703feee8, mps_ap=0x00000001003823c0, size=56) at mpsi.c:1065:5 [opt]
frame #20: 0x0000000100003f2c amcssth`make(ap=0x00000001003823c0, roots_count=180) at amcssth.c:66:5 [opt]
frame #21: 0x0000000100003e60 amcssth`churn(ap=0x00000001003823c0, roots_count=180) at amcssth.c:121:12 [opt]
frame #22: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #23: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
Here are the backtraces in the other threads
(lldb) bt all
bt all
thread #1, queue = 'com.apple.main-thread'
frame #0: 0x00000001a130ca1c libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001a1346144 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001a1343a9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x0000000100025378 amcssth`ArenaEnterLock [inlined] LockClaim(lock=0x0000000100380110) at lockix.c:124:9 [opt]
frame #4: 0x0000000100025338 amcssth`ArenaEnterLock(arena=0x0000000100138000, recursive=0) at global.c:576:5 [opt]
frame #5: 0x000000010000d04c amcssth`mps_ap_fill [inlined] ArenaEnter(arena=0x0000000100138000) at global.c:553:3 [opt]
frame #6: 0x000000010000d044 amcssth`mps_ap_fill(p_o=0x000000016fdfeed8, mps_ap=0x0000000100381e28, size=32) at mpsi.c:1062:3 [opt]
frame #7: 0x0000000100003f2c amcssth`make(ap=0x0000000100381e28, roots_count=180) at amcssth.c:66:5 [opt]
frame #8: 0x0000000100003dd8 amcssth`churn(ap=0x0000000100381e28, roots_count=180) at amcssth.c:129:43 [opt]
frame #9: 0x0000000100003a7c amcssth`test_pool(name=<unavailable>, pool=0x00000001003805c8, roots_count=180) at amcssth.c:279:5 [opt]
frame #10: 0x000000010000344c amcssth`main [inlined] test_arena at amcssth.c:345:3 [opt]
frame #11: 0x00000001000031e4 amcssth`main(argc=<unavailable>, argv=0x000000016fdff680) at amcssth.c:363:3 [opt]
frame #12: 0x00000001a101fe50 dyld`start + 2544
thread #2
frame #0: 0x00000001a1309d70 libsystem_kernel.dylib`mach_msg2_trap + 8
frame #1: 0x00000001a131b8a4 libsystem_kernel.dylib`mach_msg2_internal + 80
frame #2: 0x00000001a13125c4 libsystem_kernel.dylib`mach_msg_overwrite + 540
frame #3: 0x00000001a130a0ec libsystem_kernel.dylib`mach_msg + 24
frame #4: 0x00000001000985cc amcssth`protCatchThread [inlined] protCatchOne at protxc.c:207:8 [opt]
frame #5: 0x000000010009859c amcssth`protCatchThread(p=<unavailable>) at protxc.c:284:5 [opt]
frame #6: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
thread #3
frame #0: 0x00000001a130ca1c libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001a1346144 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001a1343a9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x0000000100025378 amcssth`ArenaEnterLock [inlined] LockClaim(lock=0x0000000100380110) at lockix.c:124:9 [opt]
frame #4: 0x0000000100025338 amcssth`ArenaEnterLock(arena=0x0000000100138000, recursive=0) at global.c:576:5 [opt]
frame #5: 0x000000010000d04c amcssth`mps_ap_fill [inlined] ArenaEnter(arena=0x0000000100138000) at global.c:553:3 [opt]
frame #6: 0x000000010000d044 amcssth`mps_ap_fill(p_o=0x000000016ff12ee8, mps_ap=0x0000000100381538, size=56) at mpsi.c:1062:3 [opt]
frame #7: 0x0000000100003f2c amcssth`make(ap=0x0000000100381538, roots_count=180) at amcssth.c:66:5 [opt]
frame #8: 0x0000000100003dd8 amcssth`churn(ap=0x0000000100381538, roots_count=180) at amcssth.c:129:43 [opt]
frame #9: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #10: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
thread #4
frame #0: 0x00000001a130ca1c libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001a1346144 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001a1343a9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x0000000100025378 amcssth`ArenaEnterLock [inlined] LockClaim(lock=0x0000000100380110) at lockix.c:124:9 [opt]
frame #4: 0x0000000100025338 amcssth`ArenaEnterLock(arena=0x0000000100138000, recursive=0) at global.c:576:5 [opt]
frame #5: 0x000000010000d04c amcssth`mps_ap_fill [inlined] ArenaEnter(arena=0x0000000100138000) at global.c:553:3 [opt]
frame #6: 0x000000010000d044 amcssth`mps_ap_fill(p_o=0x000000016ff9eee8, mps_ap=0x0000000100382518, size=48) at mpsi.c:1062:3 [opt]
frame #7: 0x0000000100003f2c amcssth`make(ap=0x0000000100382518, roots_count=180) at amcssth.c:66:5 [opt]
frame #8: 0x0000000100003dd8 amcssth`churn(ap=0x0000000100382518, roots_count=180) at amcssth.c:129:43 [opt]
frame #9: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #10: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
thread #5
frame #0: 0x00000001a130ca1c libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001a1346144 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001a1343a9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x0000000100025378 amcssth`ArenaEnterLock [inlined] LockClaim(lock=0x0000000100380110) at lockix.c:124:9 [opt]
frame #4: 0x0000000100025338 amcssth`ArenaEnterLock(arena=0x0000000100138000, recursive=0) at global.c:576:5 [opt]
frame #5: 0x000000010000d04c amcssth`mps_ap_fill [inlined] ArenaEnter(arena=0x0000000100138000) at global.c:553:3 [opt]
frame #6: 0x000000010000d044 amcssth`mps_ap_fill(p_o=0x000000017002aee8, mps_ap=0x00000001003827c8, size=40) at mpsi.c:1062:3 [opt]
frame #7: 0x0000000100003f2c amcssth`make(ap=0x00000001003827c8, roots_count=180) at amcssth.c:66:5 [opt]
frame #8: 0x0000000100003e60 amcssth`churn(ap=0x00000001003827c8, roots_count=180) at amcssth.c:121:12 [opt]
frame #9: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #10: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
thread #6
frame #0: 0x00000001a130ca1c libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001a1346144 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001a1343a9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x0000000100025378 amcssth`ArenaEnterLock [inlined] LockClaim(lock=0x0000000100380110) at lockix.c:124:9 [opt]
frame #4: 0x0000000100025338 amcssth`ArenaEnterLock(arena=0x0000000100138000, recursive=0) at global.c:576:5 [opt]
frame #5: 0x000000010000d04c amcssth`mps_ap_fill [inlined] ArenaEnter(arena=0x0000000100138000) at global.c:553:3 [opt]
frame #6: 0x000000010000d044 amcssth`mps_ap_fill(p_o=0x00000001700b6ee8, mps_ap=0x0000000100381a60, size=32) at mpsi.c:1062:3 [opt]
frame #7: 0x0000000100003f2c amcssth`make(ap=0x0000000100381a60, roots_count=180) at amcssth.c:66:5 [opt]
frame #8: 0x0000000100003e60 amcssth`churn(ap=0x0000000100381a60, roots_count=180) at amcssth.c:121:12 [opt]
frame #9: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #10: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
thread #7
frame #0: 0x00000001a130ca1c libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001a1346144 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001a1343a9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x0000000100025378 amcssth`ArenaEnterLock [inlined] LockClaim(lock=0x0000000100380110) at lockix.c:124:9 [opt]
frame #4: 0x0000000100025338 amcssth`ArenaEnterLock(arena=0x0000000100138000, recursive=0) at global.c:576:5 [opt]
frame #5: 0x000000010000d04c amcssth`mps_ap_fill [inlined] ArenaEnter(arena=0x0000000100138000) at global.c:553:3 [opt]
frame #6: 0x000000010000d044 amcssth`mps_ap_fill(p_o=0x0000000170142ee8, mps_ap=0x0000000100381c50, size=40) at mpsi.c:1062:3 [opt]
frame #7: 0x0000000100003f2c amcssth`make(ap=0x0000000100381c50, roots_count=180) at amcssth.c:66:5 [opt]
frame #8: 0x0000000100003e60 amcssth`churn(ap=0x0000000100381c50, roots_count=180) at amcssth.c:121:12 [opt]
frame #9: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #10: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
thread #8
frame #0: 0x00000001000050cc amcssth`rnd at testlib.c:83:32 [opt]
frame #1: 0x0000000100004ca8 amcssth`dylan_init(addr=<unavailable>, size=56, refs=0x00000001000ac648, nr_refs=180) at fmtdytst.c:102:22 [opt]
frame #2: 0x0000000100003f5c amcssth`make(ap=0x00000001003813f8, roots_count=180) at amcssth.c:69:11 [opt]
frame #3: 0x0000000100003e60 amcssth`churn(ap=0x00000001003813f8, roots_count=180) at amcssth.c:121:12 [opt]
frame #4: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #5: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
thread #9
frame #0: 0x00000001a130ca1c libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001a1346144 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001a1343a9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x0000000100025378 amcssth`ArenaEnterLock [inlined] LockClaim(lock=0x0000000100380110) at lockix.c:124:9 [opt]
frame #4: 0x0000000100025338 amcssth`ArenaEnterLock(arena=0x0000000100138000, recursive=0) at global.c:576:5 [opt]
frame #5: 0x000000010000d444 amcssth`mps_ap_trip [inlined] ArenaEnter(arena=0x0000000100138000) at global.c:553:3 [opt]
frame #6: 0x000000010000d438 amcssth`mps_ap_trip(mps_ap=0x0000000100381908, p=0x0000000100355b50, size=48) at mpsi.c:1100:3 [opt]
frame #7: 0x0000000100003f84 amcssth`make(ap=0x0000000100381908, roots_count=180) at amcssth.c:72:12 [opt]
frame #8: 0x0000000100003e60 amcssth`churn(ap=0x0000000100381908, roots_count=180) at amcssth.c:121:12 [opt]
frame #9: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #10: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
thread #10
frame #0: 0x00000001a130ca1c libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001a1346144 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001a1343a9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x0000000100025378 amcssth`ArenaEnterLock [inlined] LockClaim(lock=0x0000000100380110) at lockix.c:124:9 [opt]
frame #4: 0x0000000100025338 amcssth`ArenaEnterLock(arena=0x0000000100138000, recursive=0) at global.c:576:5 [opt]
frame #5: 0x000000010000d04c amcssth`mps_ap_fill [inlined] ArenaEnter(arena=0x0000000100138000) at global.c:553:3 [opt]
frame #6: 0x000000010000d044 amcssth`mps_ap_fill(p_o=0x00000001702e6ee8, mps_ap=0x0000000100383dc8, size=48) at mpsi.c:1062:3 [opt]
frame #7: 0x0000000100003f2c amcssth`make(ap=0x0000000100383dc8, roots_count=180) at amcssth.c:66:5 [opt]
frame #8: 0x0000000100003dd8 amcssth`churn(ap=0x0000000100383dc8, roots_count=180) at amcssth.c:129:43 [opt]
frame #9: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #10: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
thread #11
frame #0: 0x00000001a130ca1c libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001a1346144 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001a1343a9c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x0000000100025378 amcssth`ArenaEnterLock [inlined] LockClaim(lock=0x0000000100380110) at lockix.c:124:9 [opt]
frame #4: 0x0000000100025338 amcssth`ArenaEnterLock(arena=0x0000000100138000, recursive=0) at global.c:576:5 [opt]
frame #5: 0x000000010000d04c amcssth`mps_ap_fill [inlined] ArenaEnter(arena=0x0000000100138000) at global.c:553:3 [opt]
frame #6: 0x000000010000d044 amcssth`mps_ap_fill(p_o=0x0000000170372ee8, mps_ap=0x0000000100381f70, size=16) at mpsi.c:1062:3 [opt]
frame #7: 0x0000000100003f2c amcssth`make(ap=0x0000000100381f70, roots_count=180) at amcssth.c:66:5 [opt]
frame #8: 0x0000000100003dd8 amcssth`churn(ap=0x0000000100381f70, roots_count=180) at amcssth.c:129:43 [opt]
frame #9: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #10: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
* thread #12, stop reason = hit program assert
frame #0: 0x00000001a13121b0 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x00000001a1348cec libsystem_pthread.dylib`pthread_kill + 288
frame #2: 0x00000001a12822c8 libsystem_c.dylib`abort + 180
frame #3: 0x00000001a1281620 libsystem_c.dylib`__assert_rtn + 272
* frame #4: 0x0000000100098c10 amcssth`dylan_wrapper_check.cold.21 at fmtdy.c:117:3 [opt]
frame #5: 0x00000001000040f4 amcssth`dylan_wrapper_check(w=<unavailable>) at fmtdy.c:117:3 [opt]
frame #6: 0x000000010000474c amcssth`dylan_skip(object=0x00000001003998e8) at fmtdy.c:668:3 [opt]
frame #7: 0x00000001000709d0 amcssth`amcSegFix(seg=0x0000000100385698, ss=0x00000001703feb98, refIO=0x00000001703fea18) at poolamc.c:1600:15 [opt]
frame #8: 0x000000010003887c amcssth`_mps_fix2(mps_ss=<unavailable>, mps_ref_io=0x0000000100353520) at trace.c:1432:9 [opt]
frame #9: 0x0000000100004454 amcssth`dylan_scan1 at fmtdy.c:227:17 [opt]
frame #10: 0x0000000100004414 amcssth`dylan_scan1(mps_ss=0x00000001703feba0, object_io=<unavailable>) at fmtdy.c:478:13 [opt]
frame #11: 0x00000001000048bc amcssth`dylan_scan(mps_ss=0x00000001703feba0, base=0x0000000100353510, limit=0x0000000100353700) at fmtdy.c:527:11 [opt]
frame #12: 0x000000010007046c amcssth`amcSegScan [inlined] TraceScanFormat(ss=0x00000001703feb98, base=0x0000000100353510, limit=<unavailable>) at trace.c:1538:10 [opt]
frame #13: 0x0000000100070448 amcssth`amcSegScan(totalReturn=0x00000001703fec0c, seg=0x0000000100385870, ss=0x00000001703feb98) at poolamc.c:1428:11 [opt]
frame #14: 0x000000010008f7f8 amcssth`traceScanSegRes(ts=1, rank=<unavailable>, arena=0x0000000100138000, seg=0x0000000100385870) at trace.c:1204:11 [opt]
frame #15: 0x0000000100028714 amcssth`TraceAdvance [inlined] traceScanSeg(ts=1, rank=1, arena=0x0000000100138000, seg=0x0000000100385870) at trace.c:1266:9 [opt]
frame #16: 0x0000000100028700 amcssth`TraceAdvance(trace=0x0000000100138aa8) at trace.c:1727:13 [opt]
frame #17: 0x0000000100026e8c amcssth`TracePoll(workReturn=0x00000001703fed50, collectWorldReturn=0x00000001703fed5c, globals=0x0000000100138008, collectWorldAllowed=<unavailable>) at trace.c:1848:5 [opt]
frame #18: 0x000000010000be74 amcssth`ArenaPoll(globals=0x0000000100138008) at global.c:745:16 [opt]
frame #19: 0x000000010000d084 amcssth`mps_ap_fill(p_o=0x00000001703feee8, mps_ap=0x00000001003823c0, size=56) at mpsi.c:1065:5 [opt]
frame #20: 0x0000000100003f2c amcssth`make(ap=0x00000001003823c0, roots_count=180) at amcssth.c:66:5 [opt]
frame #21: 0x0000000100003e60 amcssth`churn(ap=0x00000001003823c0, roots_count=180) at amcssth.c:121:12 [opt]
frame #22: 0x0000000100003c5c amcssth`kid_thread(arg=0x000000016fdff008) at amcssth.c:159:5 [opt]
frame #23: 0x00000001a134906c libsystem_pthread.dylib`_pthread_start + 148
So we have a Dylan object whose wrapper is incorrect. In dylan_skip
we have an object p
with wrapper w
:
(lldb) p/x w
(mps_word_t *) $1 = 0x0000000100399ac0
but this is not one of the two wrappers created by dylan_make_wrappers
:
(lldb) p/x tvw
(mps_word_t *) $2 = 0x0000600000c00060
(lldb) p/x ww
(mps_word_t *) $3 = 0x0000600001700040
Another place that I have seen a failure is in dylan_fwd
:
Assertion failed: ((*(mps_word_t *)old & 3) == 0), function dylan_fwd, file fmtdy.c, line 729.
This is very surprising since the code in dylan_scan_contig
looks like this:
r = *p++;
if(((mps_word_t)r&3) != 0) /* pointers tagged with 0 */
goto loop; /* not a pointer */
if(!MPS_FIX1(mps_ss, r)) goto loop;
res = MPS_FIX2(mps_ss, p-1);
so that only words tagged with zero get fixed.
Since we reactivated CI, I've noticed that this failure occurs frequently on Travis CI on FRI6GC and not elsewhere. For example see https://app.travis-ci.com/github/Ravenbrook/mps/jobs/594460967#L846-L857 . It could be that the particular version of GCC on that FreeBSD is provoking it. The version string is "gcc (FreeBSD Ports Collection) 10.0.0 20191222 (experimental)". That might suggest a way to debug this further.