ocaml runtime: distinguish two kind of mutexes

Summary

This PR distinguishes two kind of mutexes:

'runtime' mutexes are for blocking critical sections which do not access the runtime.
'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system.

This refactoring comes from the discussions in #13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.

More details

The runtime has a caml_plat_lock_blocking function that takes a mutex in the obvious way. This function should be used very carefully, because it in blocks a domain without transferring control to its backup thread or otherwise listening to STW interruptions, and it can easily cause deadlocks if the critical section itself contains an STW poll point. In #13063, @gadmm introduced a different mutex-taking function, caml_plat_lock_non_blocking that releases the domain lock when it needs to block, and should be used in any critical section that could be long or needs to use the runtime system.

We are still learning about what's a correct usage discipline for these two functions (on Monday I temporarily introduced a bug in trunk, detected reported on Tuesday morning by @jmid in #13713 and fixed on Tuesday evening by @gadmm in #13714 ). In #13714 we realized that it is incorrect to mix uses of lock_blocking and lock_non_blocking on the same mutex -- except in very specific use-cases that are not currently used in the runtime. The current PR proposes to separate the two APIs so that there is no risk to make this mistake again in the future. The goal is to have a system that is simpler to reason about and to use correctly for non-experts such as myself.

Jan 08 '25 09:01 gasche

(I intend to enable multicoretests CI on this PR -- it should only be a refactoring that does not change the runtime behavior, but I wrote new code for mutator mutexes instead of just moving code around, so there may be mistakes. But I will wait first for the usual CI to be green, I have only tested it lightly locally and there is no need to unleash automated testing yet.)

Jan 08 '25 10:01 gasche

Note: I am afraid there is some overlap and potential conflicts between this PR and #13416, as the PR exploits the definition of sync_mutex in sync_posix.h as a pointer to a caml_plat_mutex. There is not much I can do in particular about this (I don't have the expertise to review #13416 to help it move forward), but I tried to not touch sync_posix.h to minimize conflicts.

Jan 08 '25 10:01 gasche

use correctly for non-experts such as myself

Speaking as a non-expert myself, I often have to remind myself that blocking/non-blocking in this context refers to the domain lock and not the mutex itself (blocking on the mutex until it's released). What I mean is that caml_plat_lock_non_blocking will wait on the mutex and is not a wrapper of, e.g., pthread_mutex_trylock. It's sort of similar to caml_enter_blocking_section and caml_leave_blocking_section. Maybe there's a case for renaming the lock functions to something like caml_lock_release_runtime_system and caml_lock_acquire_runtime_system?

Jan 10 '25 13:01 MisterDA

I see your point, but acquire_runtime_system does not work as it suggests taking the domain lock, while we expect it to be held and keep it.

(I suggested "cooperative" (and we could use "non_cooperative") before but @gadmm was not convinced.)

Maybe caml_plat_lock_blocking and just caml_mutex_lock_polling could work? The problem with caml_plat_lock is that it really is blocking on the mutex, and therefore it blocks the domain; it's nice that the name calls attention to this problem and makes people think twice about using it. On the other hand caml_mutex_lock{,_polling} is a somewhat standard mutator function, which is unsafe in contexts that don't own the runtime system but this is a safety property common with other functions of the runtime, so it may be handled in more standard ways that don't require a specific naming convention.

(It may be that some lightweight static annotations could help detect issues where mutator functions are called unsafely; there is already a dynamic debug-time system with CAMLnoalloc and CAMLalloc_point_here that I'm not familiar with and therefore do not use.)

Jan 10 '25 14:01 gasche

Just a quick comment without reading the code in-depth, the "non blocking" variants of plat functions are also "non raising" variants of the caml_ml_mutex_* functions from sync.c, that can be used wherever you cannot reasonably handle an exception. So an option would be "caml_mutex_lock_noexc" and "caml_plat_lock_blocking".

I do not see what is the use for the current caml_mutex_lock. This one seems like it would easily be misused and cause deadlocks. Before removing it, though, are we sure no-one uses it?

Jan 14 '25 18:01 gadmm

I am preparing to discuss this in today's developer meeting, and I tried to remember the details of what is incorrect and why. In terms of the current API in trunk:

caml_plat_lock_blocking is a dangerous operation because it blocks the domain, in particular it will not receive STW notifications. The OCaml runtime relies on the idea that each domain will regularly poll for STW notifications, otherwise everyone gets blocked. So caml_plat_lock_blocking can only be used for very short critical sections -- and we are in trouble if the OS deschedules the thread in the middle.
In particular, calling any operation that can start a STW section in the middle of a caml_plat_lock_blocking section is incorrect. Indeed, if two OCaml domains get into this part of the code at the same time, there can be a deadlock where domain 1 holds the lock and starts a STW section, and domain 2 is waiting for the lock and thus does not poll for STW.
caml_plat_lock_non_blocking is a variant of caml_plat_lock_blocking which releases the domain lock before locking, so some other domain thread (possibly the backup thread) will keep polling for STW interrupts and everything is fine. On the other hand, releasing the domain lock means that arbitrary mutator code can run at this point, and the C code around has to be safe against this. (In particular it better register its OCaml values as local roots!)
It is incorrect in general to use both _blocking and _non_blocking lock-taking operations on the same mutex. The problem is that this creates a lock inversion issue that can result in a deadlock: we typically call lock_blocking(m) while holding the domain lock, while lock_non_blocking(m) will release the domain lock, take m and take the domain lock again. This can result in a lock-ordering-inversion deadlock: thread 1 starts a lock_non_blocking(m) by releasing its domain lock D and taking m, thread 2 gets the domain lock D and starts a lock_blocking(m) call, now thread 1 has m and is waitinig for D and thread 2 has D and is waiting for m.

From just 1-2-3 alone our idea was that whether to use lock_blocking or lock_non_blocking could be decided independently for each critical section, giving a preference to lock_non_blocking which is typically easier to reason about (one needs the same precaution as when calling an arbitrary OCaml mutator, but no worries about mysterious deadlocks). In fact from 4 we realized that it must be global per-lock decision to use each lock in a blocking or in a non-blocking way, hence the suggestion in this PR to separate the two different types.

Mar 27 '25 08:03 gasche

Perhaps Clang's Thread Safety Analysis static annotations could help and ensure each access is correctly paired with the right kind of mutex.

Mar 27 '25 10:03 MisterDA

@gasche No, 4. was well understood and documented and part of the reasons why I started fixing the runtime code. I thought about having two types of mutexes at the time but the single type of mutex was retained because it gets more interesting if you consider the distinction between STW and non-STW phases. For data structures that are accessed during both in STW and from the mutator, you might want to use the blocking one in STW and the non-blocking from the mutator. That's what the discussion in #13227 was about. But now that I read the discussion again I don't understand what the objection was anymore. It involves custom finalizers, but those are a third category (you cannot call non_blocking yet you are not inside STW).

P.S.: Also, it was definitely not our idea that whether to use blocking or non_blocking could be decided independently for each critical section. All this is clear from the documentation of the functions.

Mar 27 '25 11:03 gadmm

Using lock_non_blocking in the mutator together with lock_blocking in the STW indeed cause deadlocks, here is an example that does not involve custom finalisers and is quite simple:

A thread on domain 0 inside lock_non_blocking on the mutator succesfully acquires M and waits for its domain lock.
A STW is requested, and another systhread on domain 0 or the backup thread serves the request.
One of the STW participants tries to acquire M.

Thus the presence of a STW does not change the situation much compared to usual lock inversion.

There was also another, valid situation where a mutex is acquired with lock_blocking outside of a domain. This is also possible in your PR using the non-blocking variant thanks to testing Caml_state_opt before releasing the lock.

So it is quite natural to distinguish two types of mutexes.

Mar 27 '25 11:03 gadmm

I rebased this PR with two changes from the reviews:

I use caml_enter_blocking_section_no_pending to fix an issue spotted by @gadmm
after a naming discussion with @MisterDA on the fact that caml_mutex_lock_non_blocking is confusing (it is!), I settled for the long and explicit caml_mutex_lock_while_yielding_the_runtime_system; at least it is clear, I hope (caml_plat_lock_blocking remains unchanged)

May 18 '25 17:05 gasche

I am going to give a contrasting opinion for the name caml_mutex_lock_while_yielding_the_runtime_system. This name makes it sound as if there was another way the mutex could be locked that did not "yield" (which I do not think is a terminology used elsewhere by the way). We went all this length to distinguish the blocking and non-blocking behaviour with the C type of the mutex. There is no other sensible behaviour for this new type of mutex than to release the domain lock and so there is no ambiguity here.

In addition, the name still fails to mention the fact that the blocking section is non-polling and no exception arises in case of error, which is the point of this function.

A simpler and more accurate name is therefore: caml_mutex_lock_noexc.

Jul 03 '25 17:07 gadmm