GC bug on skyline benchmark?
Skyline benchmark from mpllang/parallel-ml-bench.
The bug sometimes causes a segfault, although I've also seen it hang. It also appears to occur only on small core counts.
To reproduce:
$ cd parallel-ml-bench/mpl
$ make skyline.mpl.bin
$ bin/skyline.mpl.bin @mpl procs 4 -- -repeat 20
The bug is still present as of the current commit, b69ca194fc22ade32b571393f0433de77ab48ebb
I think the bug is somewhere in CC. (If I disable forkGC in the scheduler, then the bug appears to go away. Or, at least, I haven't been able to trigger it after making that change.)
Some notes:
- Using
@mpl max-cc-depth 1 --(see PR #163) seems to make the bug go away. This limits CGC, making it very shallow, in particular by only allowing CGC on the root heap. - In gdb, I was able to trigger assertion failures on debug version using a smaller size (
-size 100000). - The bug appears to be a dangling pointer originating from within the work-stealing deque.
- Perhaps work-stealing deque is not properly tracked by remembered sets in CGC?
- On
shwestrick/mpl/gc-debug, I tried additionally snapshotting the contents of the deque when CGC is spawned. But this didn't seem to change anything.
- On
- Perhaps LGC is forwarding an object reachable from deque but not updating the down-pointer? (Then later CGC takes over, and witnesses a dangling down-pointer.)
- TODO: double check that LGC handles objects reachable from deque properly.
Some possible progress on this.
I've discovered a race between LGC and scheduler steals. The problem is in the implementation of the ABP concurrent deque: on a steal, the read of the stolen value is performed optimistically before the CAS to confirm the steal. In-between the optimistic read and CAS, a concurrent LGC could relocate the object. To fix this, I think all down-pointers from the work-stealing deque need to be pinned.
In our implementation so far, we've been handling the work-stealing deque specially. Its updates are not subjected to the standard write barrier, because this would cause all down-pointers from the deque to stay live forever.
But, an interesting point: if I subject the deque to the standard write barrier, then the bug seems to go away. (At least, I haven't been able to trigger the bug in this case yet.)
So, the interesting challenge now is to figure out how to pin deque down-pointers while also allowing these to be unpinned appropriately at a later time. Our current unpin-depth trick won't work, because the deques live in the global heap (depth 0), and after scheduler initialization, the program will never again return to depth 0.
Proposal: we could use a hybrid remembered set strategy, delimited by depth. For objects x at depth 0 and down-pointers x[i] := y, we would use remembered set entries of the form (x,i,y), enabling us to later invalidate the entry when x[i] != y. For all other objects (at depth 1 or deeper), we would continue to use unpin depths.
This appears to be fixed! In d1646cf which was merged as part of #180.