mmtk-core icon indicating copy to clipboard operation
mmtk-core copied to clipboard

`is_collection_enabled` is not thread-safe

Open qinsoon opened this issue 3 months ago • 7 comments

https://github.com/mmtk/mmtk-core/pull/1075 removed disable/enable_collection and replaced it with Collection::is_collection_enabled. However, it seems a binding cannot implement enable/disable collection in a thread-safe way with the latter API.

Consider a scenario like this:

  1. GC is enabled by default.
  2. Thread 1 tries to allocate and polls for GC. The poll is allowed, thus MMTk schedules work packets for the GC. However, none of the work packets has been executed yet (including StopMutators and Collection::stop_all_mutators).
  3. Thread 2 tries to disable GC. It sets a binding-side flag, and future Collection::is_collection_enabled will return this flag value. Thread 2 further invokes a GC safe point to make sure there is no ongoing GC request yet. As stop_all_mutators has not been called yet, the binding does not know there is a GC and the safe point simply returns.
  4. Thread 2 starts to execute code that is supposed to run only when GC is disabled.
  5. StopMutators gets executed by MMTk, and it asks the binding to stop mutators.
  6. Thread 2 runs to a GC safe point, and gets stopped at a place where GC is not supposed to happen.

Though the document for is_collection_enabled states that 'any synchronization involving enabling and disabling collections by mutator threads should be implemented by the VM', there seems to be no way for the VM to do this.

The core issue is that there exists a window between when MMTk polls for GC and when it executes stop_all_mutators. During this window:

  • MMTk already considers a GC in progress.
  • The binding, however, cannot detect that GC has been triggered.

If the binding disables GC in that interval, the two states diverge -- leading to GC happening when the VM assumes it cannot.

The previous enable/disable_collection API did not have this problem. When the binding called these methods, MMTk could internally check whether a GC was ongoing and block the call until the GC completed, ensuring proper synchronization.

qinsoon avatar Oct 07 '25 04:10 qinsoon

The other option is to ask the binding to call memory_manager::gc_poll as soon as the binding disables the GC, and we need an option/flag for gc_poll to always poll (ignoring whether GC is disabled or not). Otherwise once the binding sets GC as disabled, gc_poll will not poll for GC.

qinsoon avatar Oct 07 '25 23:10 qinsoon

I think the "is_collection_enabled" thing is usually sloppy. For some programming languages, it doesn't really matter whether it must be absolutely forbidden to trigger GC. In CRuby, for example, GC::start forces a GC even when GC::disable has been called.

I think the scenario @qinsoon described involves two issues:

  1. Whether MMTk core thinks it should trigger GC, and
  2. whether a particular mutator thread is willing to stop for GC

Per-thread handshake

I think the "Thread 2" in the example wants a kind of "critical section" semantics, inside which the GC should not block Thread 2. If Thread 2 doesn't allocate during the critical section, and does not wait for other mutator threads (other threads may not be in critical sections and may willingly block for GC), the simplest solution will be using per-thread handshake instead of global stop-the-world flag. This means Thread 1 can still trigger GC, and MMTk core can still call stop_mutators. But stop_mutators will try to stop each mutator separately. Particularly, Thread 2 will not block for GC until it exits critical section.

Atomic GC-triggering API

If per-thread handshake is impossible, and therefore there must be a global decision of whether GC should be triggered or GC should be disabled, MMTk-core will need to provide a new set of APIs. There will be a VM-specific state on the binding side which basically has three states: GCIdle, GCTriggered and GCDisabled (not visible to MMTk-core). It can be set atomically using the following APIs:

  • Collection::trigger_gc(): Atomically set the state from GCIdle to GCTriggered. If successful, return, and MMTk core can start GC. Otherwise block until it is possible to trigger GC.
  • Collection::try_trigger_gc() -> Result<(), ()>: A non-blocking version of trigger_gc(). Return Err(()) immediately if cannot trigger GC now.
  • Collection::clear_gc_trigger_state(): Resets the state to GCIdle, and unblock blocking threads (may be VM-specific threads).

And the VM binding can atomically set the state to GCDisabled, in which case Collection::trigger_gc will block, and Collection::try_trigger_gc() will return Err.

MMTk core will call Collection::trigger_gc() or ``Collection::try_trigger_gc()right before callingself.gc_requester.request();(note: not before callingpoll. This means is_collection_enabledis not deprecated.). It will lettrigger_gc` race with the VM binding, and arbitrate between triggering GC and disabling GC.

We leave the state as VM-specific because the VM may represent the GCDisabled state with a counter GCDisabled { count: usize }. It may count the number of mutators currently in critical section.

Issues to consider

Should the state be implemented in mmtk-core or the VM binding? Implementing it in the binding definitely gives the VM binding more freedom. But if we can move it to mmtk-core and still be able to handle multiple mutators entering/leaving critical sections independently, it may simplify the implementation of bindings because the default can be always allowing MMTk-core to trigger GC, and the VM binding doesn't need to transition the state to GCDisabled unless it has any kind of critical section semantics.

Can we merge Collection::trigger_gc with Collection::is_collection_enabled? Currently, is_collection_enabled is called before calling poll, but Collection::trigger_gc must be called after poll because it is a decision-making function. They are, in theory, two different things. is_collection_enabled is a hint, while trigger_gc is for arbitration.

wks avatar Oct 09 '25 03:10 wks

I think the "Thread 2" in the https://github.com/mmtk/mmtk-core/issues/1398#issue-3489912596 wants a kind of "critical section" semantics

For Julia, though it looks like a 'critical section', it is actually a global state: https://github.com/JuliaLang/julia/blob/eb4006be44fd6bfb0f01d59672a54ac48a24407f/src/gc-common.c#L659. So we have to comply with it and allow GC to be disabled globally.

MMTk-core will need to provide a new set of APIs.

This is not necessarily true. https://github.com/mmtk/mmtk-core/issues/1398#issuecomment-3379049326 would be the minimal change that we need to make it work -- we just need MMTk to provide a safepoint call so the binding can check if MMTk has triggered a GC.

qinsoon avatar Oct 09 '25 04:10 qinsoon

I think the "Thread 2" in the #1398 (comment) wants a kind of "critical section" semantics

For Julia, though it looks like a 'critical section', it is actually a global state: https://github.com/JuliaLang/julia/blob/eb4006be44fd6bfb0f01d59672a54ac48a24407f/src/gc-common.c#L659. So we have to comply with it and allow GC to be disabled globally.

That's interesting. If it is a global flag in Julia, and Julia supports true multi-thread concurrency (no GIL), then Julia must have faced similar issues. What will happen in Julia's own GC if Thread 1 triggers GC and Thread 2 disables GC after that but before reaching the next safepoint?

MMTk-core will need to provide a new set of APIs.

This is not necessarily true. #1398 (comment) would be the minimal change that we need to make it work -- we just need MMTk to provide a safepoint call so the binding can check if MMTk has triggered a GC.

This has the TOCTTOU problem. If Thread 1 triggers GC between the window after Thread 2 seeing "GC is not triggered" but before Thread 2 disabling GC, the result will be the same as Thread 2 disabling GC without checking whether GC is triggered at all. Concretely,

  1. In the beginning, bool gc_triggered = false;
  2. Thread 2 checks gc_triggered and sees false. Then the OS suspended Thread 2.
  3. Thread 1 allocates and triggers GC. It sets gc_triggered to true.
  4. Thread 2 wakes up. Now gc_triggered == true, but Thread 2 thinks it has just checked that gc_triggered == false. Thread 2 then disables GC.
  5. A GC worker thread calls stop_mutators
  6. Thread 2 hits a safepoint and stopped, but Thread 2 thinks GC has been disabled, and Thread 2 shouldn't block for GC.

If the decision must be made consistently between threads, mere checking won't work. We need atomic read-modify-write operations to arbitrate between two conflicting decisions (can be implemented with a lock or atomic RMW memory access).

The other option is to ask the binding to call memory_manager::gc_poll as soon as the binding disables the GC, and we need an option/flag for gc_poll to always poll (ignoring whether GC is disabled or not). Otherwise once the binding sets GC as disabled, gc_poll will not poll for GC.

If we make gc_poll always call mmtk.gc_trigger.poll, but do not make other changes, then mmtk.gc_trigger.poll will always call self.gc_requester.request() if the GCTrigger considers it good time to trigger GC. It still races with the operation that disables GC. As I mentioned above, there is the TOCTTOU problem.

wks avatar Oct 09 '25 06:10 wks

A thread would have to disable GC first, then invoke a safepoint check as here. However, that safepoint check cannot be the runtime/language side check, and has to be a check with MMTk (like gc_poll).

qinsoon avatar Oct 09 '25 06:10 qinsoon

A thread would have to disable GC first, then invoke a safepoint check as here. However, that safepoint check cannot be the runtime/language side check, and has to be a check with MMTk (like gc_poll).

It will still break if Thread 1 and Thread 2 interleaves differently:

  1. In the beginning, bool gc_triggered = false; and bool gc_enabled = true;.
  2. Thread 1 sees gc_enabled == true and is just about to trigger GC. The OS context-switched.
  3. Thread 2 disables GC, setting gc_enabled = false.
  4. Thread 2 checks if GC is triggered. It sees gc_triggered = false and continues.
  5. Thread 1 wakes up and triggers GC, setting gc_triggered = true.
  6. A GC worker thread calls stop_mutators.
  7. Thread 2 is stopped at a safepoint, although it thought gc_triggered were false.

wks avatar Oct 09 '25 06:10 wks

Right. So the check for gc_enabled and the update for gc_triggered needs to happen atomically. If these two states sit separately in MMTk core and the binding, it is impossible to perform this operation atomically. Maybe we can just provide a thread-safe version of enable/disable_collection, and handle the thread safety inside MMTk.

qinsoon avatar Oct 09 '25 23:10 qinsoon