mpl GC bug on skyline benchmark?

Skyline benchmark from mpllang/parallel-ml-bench.

The bug sometimes causes a segfault, although I've also seen it hang. It also appears to occur only on small core counts.

To reproduce:

$ cd parallel-ml-bench/mpl
$ make skyline.mpl.bin
$ bin/skyline.mpl.bin @mpl procs 4 -- -repeat 20

The bug is still present as of the current commit, b69ca194fc22ade32b571393f0433de77ab48ebb

I think the bug is somewhere in CC. (If I disable forkGC in the scheduler, then the bug appears to go away. Or, at least, I haven't been able to trigger it after making that change.)

May 31 '22 14:05 shwestrick

Some notes:

Using @mpl max-cc-depth 1 -- (see PR #163) seems to make the bug go away. This limits CGC, making it very shallow, in particular by only allowing CGC on the root heap.
In gdb, I was able to trigger assertion failures on debug version using a smaller size (-size 100000).
The bug appears to be a dangling pointer originating from within the work-stealing deque.
Perhaps work-stealing deque is not properly tracked by remembered sets in CGC?
- On shwestrick/mpl/gc-debug, I tried additionally snapshotting the contents of the deque when CGC is spawned. But this didn't seem to change anything.
Perhaps LGC is forwarding an object reachable from deque but not updating the down-pointer? (Then later CGC takes over, and witnesses a dangling down-pointer.)
- TODO: double check that LGC handles objects reachable from deque properly.

Aug 30 '22 15:08 shwestrick

Some possible progress on this.

I've discovered a race between LGC and scheduler steals. The problem is in the implementation of the ABP concurrent deque: on a steal, the read of the stolen value is performed optimistically before the CAS to confirm the steal. In-between the optimistic read and CAS, a concurrent LGC could relocate the object. To fix this, I think all down-pointers from the work-stealing deque need to be pinned.

In our implementation so far, we've been handling the work-stealing deque specially. Its updates are not subjected to the standard write barrier, because this would cause all down-pointers from the deque to stay live forever.

But, an interesting point: if I subject the deque to the standard write barrier, then the bug seems to go away. (At least, I haven't been able to trigger the bug in this case yet.)

So, the interesting challenge now is to figure out how to pin deque down-pointers while also allowing these to be unpinned appropriately at a later time. Our current unpin-depth trick won't work, because the deques live in the global heap (depth 0), and after scheduler initialization, the program will never again return to depth 0.

Proposal: we could use a hybrid remembered set strategy, delimited by depth. For objects x at depth 0 and down-pointers x[i] := y, we would use remembered set entries of the form (x,i,y), enabling us to later invalidate the entry when x[i] != y. For all other objects (at depth 1 or deeper), we would continue to use unpin depths.

Sep 28 '22 14:09 shwestrick

This appears to be fixed! In d1646cf which was merged as part of #180.

Feb 19 '24 03:02 shwestrick