ocaml icon indicating copy to clipboard operation
ocaml copied to clipboard

runtime/minor_gc.c: update young-pointers after leaving the barrier

Open gasche opened this issue 4 months ago • 5 comments

This is another iteration after #14305 failed. The idea is to try to assume that the memprof state may be mutated by the GC, and still ensure that the implementation behaves correctly when that occurs.

In #14305 I tried to make caml_memprof_set_trigger safe to call during the minor GC critical section, but this approach is doomed to fail. In this new PR, I move the place where caml_memprof_set_trigger is called during the minor GC, to after the domain has left the GC critical section, so we know that no other domains are running GC code anymore.


The implementation of caml_memprof_set_trigger accesses memprof internal state that is stored on the GC-ed heap. This cannot be called safely while other domains may be mutating the heap during minor collections. (This results in data races, see #14300).

To avoid this issue, we move the update of the domain's young-pointer to after leaving the minor GC barrier. At this point we know that all domains are done collecting the heap.

gasche avatar Oct 18 '25 09:10 gasche

My intuition is that this change is more principled/robust than #14304 (which is arguably a hack to avoid a known-bad interaction with the GC, rather than a guarantee that the issue is solved), but also more invasive for the rest of the runtime, so it requires a more careful review from people with GC knowledge. ( @NickBarnes would of course be a good reviewer for both, being both a memprof and a GC expert. )

(In particular I find it tempting to consider merging #14304 relatively quickly, to unblock the TSan CI, but then also merge the current PR if we believe it is correct, to solve the issue long-term.)

Note: this gives me the impression that storing data used by the runtime as structured data on the OCaml heap is a rather delicate affair. It may be more robust to store this data off-heap, with a proxy value on the OCaml heap that is exposed to users.

gasche avatar Oct 18 '25 09:10 gasche

This looks good to me, indeed.

One "subtle" point is that the major GC is called to do opportunistic work, with a minor heap in an inconsistent state (since the young pointers are not valid). But it seems indeed valid that the major GC will never access these state fields of the minor heap.

jhjourdan avatar Oct 23 '25 09:10 jhjourdan

For the record, keeping statmemprof configuration information on the C heap rather than the Caml heap is tricky, because of its shared and atomic nature: several domains may share a "profile" (an abstract value which is in fact the configuration record), and changes to the profile (such as Gc.Memprof.stop) should atomically affect all such domains. If we wanted to keep (say) the "status" value in a block on the C heap, with pointers from the per-domain or per-thread runtime memprof structures, then that's possible but not trivial (e.g. would we have a finalizer on some Caml heap block to free the C heap block?).

NickBarnes avatar Nov 26 '25 16:11 NickBarnes

This looks right to me. The trigger value isn't needed until this domain is allocating again, so it's safe to defer that until after we've left the barrier, and doing it any sooner than that is risky because some other domain might be touching the profile (stored on the heap): the improvements in #14304 and #14384 make that less risky but don't change the principle that on-heap data shouldn't be touched then. Sharing the previously-duplicated code to set the various "young pointers" is also a clear improvement.

NickBarnes avatar Nov 28 '25 10:11 NickBarnes

Thanks for the review (and also @jhjourdan!). I will try to please the CI, add an extra comment I felt was missing on re-review (clarify that we intentionally reset the trigger after the STW critical section and not during), and I think that I will proceed to merge.

In the last few weeks I was of the opinion that a statmemprof-side fix to store the relevant bits of configuration outside the heap might a simpler overall fix, but the experiments by @stedolan suggest that this is not in fact the case, so I'm happy to decide that the present PR is in fact a reasonable way to proceed long-term.

gasche avatar Nov 28 '25 10:11 gasche