mpl icon indicating copy to clipboard operation
mpl copied to clipboard

GC bug on skyline benchmark?

Open shwestrick opened this issue 3 years ago • 2 comments

Skyline benchmark from mpllang/parallel-ml-bench.

The bug sometimes causes a segfault, although I've also seen it hang. It also appears to occur only on small core counts.

To reproduce:

$ cd parallel-ml-bench/mpl
$ make skyline.mpl.bin
$ bin/skyline.mpl.bin @mpl procs 4 -- -repeat 20

The bug is still present as of the current commit, b69ca194fc22ade32b571393f0433de77ab48ebb

I think the bug is somewhere in CC. (If I disable forkGC in the scheduler, then the bug appears to go away. Or, at least, I haven't been able to trigger it after making that change.)

shwestrick avatar May 31 '22 14:05 shwestrick

Some notes:

  • Using @mpl max-cc-depth 1 -- (see PR #163) seems to make the bug go away. This limits CGC, making it very shallow, in particular by only allowing CGC on the root heap.
  • In gdb, I was able to trigger assertion failures on debug version using a smaller size (-size 100000).
  • The bug appears to be a dangling pointer originating from within the work-stealing deque.
  • Perhaps work-stealing deque is not properly tracked by remembered sets in CGC?
    • On shwestrick/mpl/gc-debug, I tried additionally snapshotting the contents of the deque when CGC is spawned. But this didn't seem to change anything.
  • Perhaps LGC is forwarding an object reachable from deque but not updating the down-pointer? (Then later CGC takes over, and witnesses a dangling down-pointer.)
    • TODO: double check that LGC handles objects reachable from deque properly.

shwestrick avatar Aug 30 '22 15:08 shwestrick

Some possible progress on this.

I've discovered a race between LGC and scheduler steals. The problem is in the implementation of the ABP concurrent deque: on a steal, the read of the stolen value is performed optimistically before the CAS to confirm the steal. In-between the optimistic read and CAS, a concurrent LGC could relocate the object. To fix this, I think all down-pointers from the work-stealing deque need to be pinned.

In our implementation so far, we've been handling the work-stealing deque specially. Its updates are not subjected to the standard write barrier, because this would cause all down-pointers from the deque to stay live forever.

But, an interesting point: if I subject the deque to the standard write barrier, then the bug seems to go away. (At least, I haven't been able to trigger the bug in this case yet.)

So, the interesting challenge now is to figure out how to pin deque down-pointers while also allowing these to be unpinned appropriately at a later time. Our current unpin-depth trick won't work, because the deques live in the global heap (depth 0), and after scheduler initialization, the program will never again return to depth 0.

Proposal: we could use a hybrid remembered set strategy, delimited by depth. For objects x at depth 0 and down-pointers x[i] := y, we would use remembered set entries of the form (x,i,y), enabling us to later invalidate the entry when x[i] != y. For all other objects (at depth 1 or deeper), we would continue to use unpin depths.

shwestrick avatar Sep 28 '22 14:09 shwestrick

This appears to be fixed! In d1646cf which was merged as part of #180.

shwestrick avatar Feb 19 '24 03:02 shwestrick