rust icon indicating copy to clipboard operation
rust copied to clipboard

Tracking issue for improving std::sync::{Mutex, RwLock, Condvar}

Open m-ou-se opened this issue 2 years ago • 90 comments

A few weeks ago, on January 12th, the @rust-lang/libs team discussed a plan to improve std::sync::Mutex, std::sync::Condvar, and std::sync::RwLock.

On most platforms, these structures are currently wrappers around their pthread equivalent, such as pthread_mutex_t. These types are not movable, however, forcing us to wrap them in a Box, resulting in an allocation and indirection for our lock types. This also gets in the way of a const constructor for these types, which makes static locks more complicated than necessary.

@Amanieu presented his library, parking_lot, which implements the 'parking lot' structure from WebKit. Rather than letting the operating system handle any waiting queues, this manages the queues in user space in a global hash table indexed by the address of the lock. This allows for flexible 1-byte mutexes with almost no restrictions.

There has been an attempt at merging parking_lot into std, as a replacement for the implementation of std's locking primitives. However, many different blockers came up and that PR did not get merged. (See also my RustConf talk for some of this history.) Many of these blockers have been resolved since.

One of the problems with replacing std's lock implementations by parking_lot is that parking_lot allocates memory for its global hash table. A Rust program can define its own custom allocator, and such a custom allocator will likely use the standard library's locks, creating a cyclic dependency problem where you can't allocate memory without locking, but you can't lock without first allocating the hash table.

A possible solution is to use a static table with a fixed number of entries, which does not scale with the number of threads. This could work well in many situations, but can become problematic in highly parallel programs and systems. Another possible solution is to skip the global allocator and directly allocate this hash table with a native system API such as mmap on Unix. (See this thread for some discussion.)

In our meeting, we discussed what's natively available on various operating systems/platforms:

  • On Windows, we have already switched to Windows SRW locks, which do not require allocation as they can be moved. They are small (32-bit), efficient, and const constructible. Just for Windows, there does not seem much advantage to switching to parking_lot, although there might be some (small) performance difference.

  • On both Linux and some BSDs there's a native futex() syscall, which provides functionality similar (but more primitive) than a parking lot. Nearly equivalent functionality is available in Wasm. While not trivial, it's possible to implement our synchronization primitives directly based on such a futex. This makes all the primitives small (32-bit), efficient, and const constructible.

  • On macOS and some other Unixes there doesn't seem to be anything that's similar futex syscall. (parking_lot uses pthread for its thread parking operations on those platforms.)

  • Some smaller platforms that are (partially) supported by the standard library have their own non-trivial implementation of the standard locks, which are hard to maintain and have not gotten much validation.

After some discussion, the consensus was to providing the locks as 'thinnest possible wrapper' around the native lock APIs as long as they are still small, efficient, and const constructible. This means SRW locks on Windows, and futex-based locks on Linux, some BSDs, and Wasm. And for other platforms without suitable native APIs, a parking_lot-based implementation using one of the possible workarounds for the allocation problem.

This means that on platforms like Linux and Windows, the operating system will be responsible for managing the waiting queues of the locks, such that any kernel improvements and features like debugging facilities in this area are directly available for Rust programs.

It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.

Update: We've decided to not directly use parking lot's one-byte (two bit) locks, but instead use the equivalent of its internal WordLock or Windows' SRW locks, which are one pointer in size and require no global state in the program. That solves the allocation problem.

Update 2: To maintain priority inheritance of the native locks (e.g. on macOS), we've kept using pthread locks on non-futex unix platforms, at least for now. Using https://github.com/rust-lang/rust/pull/97647, we've made it possible for the locks to have const fn new.


Primary goals:

  • [x] Efficient non-allocating locks
    • [x] Windows
    • [x] Linux
    • [ ] ~~macOS~~
      • We can't avoid pthread if we want to keep features of the OS' native locks, like priority iniheritance.
    • [x] {Free, Open}BSD
  • [x] Turn std::sync::{Mutex, Condvar, RwLock}::new into const functions: https://github.com/rust-lang/rust/pull/97791
  • [x] Resolve some bugs around pthreads
    • [x] https://github.com/rust-lang/rust/issues/85434
    • [x] https://github.com/rust-lang/rust/issues/94564#issuecomment-1098942913

Possible later goals:

  • Add a Condvar::wait_rwlock to make Condvar usable with RwLocks?
  • Allow Sending MutexGuards to other threads?

To do:

  • [x] Relax Condvar requirements to allow for unboxed mutexes. https://github.com/rust-lang/rust/pull/76932
  • [x] Use unboxed SRW locks on Windows.
    • [x] Make Microsoft promise that SRW locks are indeed movable. https://github.com/MicrosoftDocs/sdk-api/pull/447
    • [x] https://github.com/rust-lang/rust/pull/76645
    • [x] Refactor sys_common::Mutex to have a separate MovableMutex type to allow unboxing on some platforms. https://github.com/rust-lang/rust/pull/77147
    • [x] Remove the Boxes in Mutex and Condvar on Windows and Wasm. https://github.com/rust-lang/rust/pull/77380
    • [x] Remove the Box from RwLock on Windows and Wasm. https://github.com/rust-lang/rust/pull/84687
    • [x] (Start using WaitOnAddress and NtWaitForKeyedEvent in the standard library.) https://github.com/rust-lang/rust/pull/77618
      • This unblocks usage of these APIs in parking_lot if we want to use that on Windows too. But if we just use SRW locks, this was not necessary to unblock the lock improvements.
  • [x] Use futex-based locks on Linux.
    • [x] Start using the futex syscall to get some experience with it in the standard library. https://github.com/rust-lang/rust/pull/76919
    • [x] Implement the futex() syscall in Miri to allow Miri to run standard library tests. https://github.com/rust-lang/miri/pull/1568
      • [x] Also implement the bitset futex operations in Miri. https://github.com/rust-lang/miri/pull/2054
    • [x] Implement Mutex and Condvar using futex().
      • [x] Design these.
        • [x] Experimentation: https://github.com/m-ou-se/futex-lock-experiment/
        • [x] Investigate other implementations (glibc, musl, Windows, etc.)
          • [x] musl: https://github.com/rust-lang/rust/issues/93740#issuecomment-1041391284
          • [x] glibc
            • [x] Mutexes: https://github.com/rust-lang/rust/issues/93740#issuecomment-1041572651
            • [x] Condition variables: https://github.com/rust-lang/rust/issues/93740#issuecomment-1048886597
            • [x] Reader-writer locks: https://github.com/rust-lang/rust/issues/93740#issuecomment-1055354913
          • [x] Boost: https://github.com/rust-lang/rust/issues/93740#issuecomment-1064286563
          • [x] Windows' SRW locks: https://github.com/rust-lang/rust/issues/93740#issuecomment-1064139337
          • [x] Wine's SRW locks: https://github.com/rust-lang/rust/issues/93740#issuecomment-1055672041
          • [x] Apple Darwin libpthread: https://github.com/rust-lang/rust/issues/93740#issuecomment-1069240178
          • [x] FreeBSD's libpthread: https://github.com/rust-lang/rust/issues/93740#issuecomment-1069265943
        • [x] Make some design decisions: https://github.com/rust-lang/rust/issues/93740#issuecomment-1070696128
      • [x] Implementation: https://github.com/rust-lang/rust/pull/95035
    • [x] Implement ReentrantMutex using futex(): https://github.com/rust-lang/rust/pull/95727
    • [x] Implement RwLock using futex(): https://github.com/rust-lang/rust/pull/95801
  • [x] Use the same ReentrantMutex on all platforms: https://github.com/rust-lang/rust/pull/96042
  • [x] Use futex-based locks on *BSD.
    • [x] Add NetBSD's FUTEX_* constants to the libc crate. https://github.com/rust-lang/libc/pull/2762
    • [x] Add OpenBSD's futex() to the libc crate. https://github.com/rust-lang/libc/pull/2761
    • [x] Add FreeBSD umtx functions and constants to the libc crate. https://github.com/rust-lang/libc/pull/2770
    • [x] Add DragonFlyBSD's umtx_* functions to the libc crate. https://github.com/rust-lang/libc/pull/2763
    • [x] Switch *BSD over to the same futex locks as on Linux. https://github.com/rust-lang/rust/pull/96510
  • [x] Use futex based locks on all other platforms that support a futex-like API.
    • [x] Use futex based locks on Emscripten. https://github.com/rust-lang/rust/pull/96205
    • [x] Use atomic.wait/atomic.notify based locks on Wasm. https://github.com/rust-lang/rust/pull/96206
    • [ ] Use zx_futex_wait based locks on Fuchsia.
      • (Tier 2 platform.)
    • [ ] Use syscall::call::futex based locks on Redox.
      • (Tier 3 platform.)
    • [x] Use futexes on Hermit. https://github.com/rust-lang/rust/pull/101475
  • [x] Lazily allocate on platforms where we still use pthread: https://github.com/rust-lang/rust/pull/97647
    • [x] macOS
    • [x] NetBSD, other unix
  • [x] Make the new functions const. https://github.com/rust-lang/rust/pull/97791
  • [ ] Use 'WordLock' (aka SRW lock, aka 'user space linked-list queue lock') on other platforms.
    • [x] Remove locks from Instant::now(), to avoid a cyclic dependency.
    • [x] Stop using std::sync::{Mutex, Condvar} in std::sys's thread parker, to avoid a cyclic dependency. https://github.com/rust-lang/rust/pull/96393
    • [ ] Implement this type of locks, and replace the lock implementations of non-futex platforms by it.
  • [x] Mark this issue as fixed: https://github.com/rust-lang/rust/issues/85434
  • [ ] Write some blog post about all this.
  • [ ] Celebrate. :tada:

m-ou-se avatar Feb 07 '22 16:02 m-ou-se

~~Did you investigate using os_unfair_lock (Apple Developer Docs) on macOS? It can't be moved once it is in use either but it is essentially just a struct with a u32 in it, initialized to 0 before use. But maybe I am missing some requirements here. If it seems possible, investigation of it could be added as a subtask.~~

~~It is available since macOS 10.12 on x86-64, but could be used unconditionally on aarch64.~~

Nevermind it does not have timeouts for os_unfair_lock_lock/os_unfair_lock_trylock, so it is obviously out.

hkratz avatar Feb 07 '22 17:02 hkratz

I don't think we need to use hash tables like parking lot for targets without futex support. We could just store the queues inside Mutex. I think the only downside is size, but that'll still be better than the current boxed pthread mutex though.

nbdd0121 avatar Feb 07 '22 18:02 nbdd0121

Nevermind it does not have timeouts for os_unfair_lock_lock/os_unfair_lock_trylock, so it is obviously out.

Also it doesn't support Condvar, which is required for Rust locks.

Amanieu avatar Feb 07 '22 18:02 Amanieu

Oh. This is a subject extremely near-and-dear to my heart -- I even started on a pre-RFC for atomic wait/wake ops, like the ones that C++ got recently. That said, much of my motivation was hoping it would find a way to unstick us here, and I this is clearly a higher priority.

For futex APIs, I wrote a blog post on them at one point. https://shift.click/blog/futex-like-apis I wrote about some of the options here for system APIs, which might be helpful as reference.

That said, I got asked about Darwin's __ulock_* API in a Zulip DM, since I wrote that post as well as hte ulock-sys crate.


For Darwin, it I think it's quite unlikely we could use __ulock_{wait,wake} for a futex API directly without risking... many problems. The fact that it may break in the future is actually one of the less concerning issues, in comparison to the fact that that use of private APIs can cause rejection from app stores, which would be a problem for Rust code that on the iOS/Mac App Stores (for clarity: I'm just going on a hunch[^stores] that could happen for __ulock_*, I do not know for certain, but I think libstd needs to favor caution here).

That said, there's... a pretty cursed option for a workaround on these targets. Too cursed to use? Not sure... possibly.

That is: libc++'s implementation of std::atomic<u32>::wait, notify_one, and notify_all will ultimately bottom out to calls to __ulock_wait/__ulock_wake, which... I guess they get to use because libc++ gets shipped and updated with the OS on Darwin, fair enough. Anyway, in theory we could just call that somehow. And there's a sense in which this would be pretty nice, since it's a system-provided function that implements (something similar to) the API we want... But it's certainly not ideal for a few reasons:

  1. It's unclear to me how much we promise about what we link to on various platforms, but it's possible we cannot do this, and likely that we do not want to... I don't know if we make promises about what we need to link against for libstd, but suddenly dynamically linking against the C++ standard library on some platforms... feels like it would be a change.

  2. Not to mention -- even if we are allowed to do this, it's still a pain in the butt: The version of libc++ that provides this this is much newer than our platform guarantees, so we'd need to weak-import[^dl] the dubiously-public[^abi] symbol that this stuff under the hood (although I suppose this is better than requiring a C++ compiler to build Rust code for these platforms).

So, maybe? Given that the OS provides libc++, it seems like it should technically be fine (unless we specifically say we won't). At least, so long as we manage to do it in a way that doesn't versions older than we care about. But maybe I've missed something (or I haven't, but it's still several bridges too far).

(If not, oh well, someday an API shaped mostly like that is bound to end up as part of the libc someday, after it's copied to C from C++ in C9Z 😛)

(EDIT: Thinking about it more, it's fairly likely we should have some justification for this approach, since looking at the libc++ code, enough happens between what our all would be that it could defeat the point. And the effort spent here to make it work would be better spent on improving our fallback)


All that said, there are a few options on various platforms that could probably help implement the fallback thread parker, in case there are regressions (which plausibly there wouldn't be if we use parking_lot_core as-is, but maybe there would be if we need to abandon the growth behavior). In particular:

  • A few of the futexless unices (NetBSD, the Solaris-likes, perhaps others?) have a lwp_park and lwp_unpark functions, which is basically just an OS primitive for thread parking and unparking.

  • On Darwin (yes, yes, more Darwin), I think we could probably use libdispatch's dispatch_semaphore instead, as it's been the target of a lot of optimization -- probably more than the currently-active verions of pthread_mutex_t / pthread_cond_t -- Not to mention, you'd expect that a one-piece semaphore can be implemented more efficiently than a condvar/mutex pair, at least... in theory[^more]. At least, unless the unified semaphore doesn't have to uphold extra guarantees (as it does for POSIX's sem_t, I believe).

  • Hermit (and perhaps only hermit but perhaps others) only has a semaphore, and we implement other primitives on top of them. I don't have a vested interest in hermit, but I would really like any excuse to remove https://github.com/rust-lang/rust/blob/master/library/std/src/sys/hermit/condvar.rs which look like it's implementing the same algorithm it cites, and seems like it could be buggy[^hermit].

That said, a lot of this probably should be in the a tail of followups, and only if it's worthwhile.

[^more]: That is, at worst the single-piece semaphore could just be implemented in terms of a private condvar and mutex pair, resulting in the same performance. That said, relying too heavily on this style of logic is a good way to get unpleasant surprises, so who knows.

[^stores]: I mean, if nothing else they start with __ and don't appear in a single public header (they only appear in the headers inside apple source repos). So, even though these APIs don't seem to be in the blacklists that prevents you from even attempting app store submission (which is not a guarantee), it seems completely out of the question to use in libstd (even if I do think they're cool, and I do).

[^dl]: Calls to dlopen/dlsym tend to be disallowed in the same situations, since tooling can't tell if you're using it to use private APIs. (We already do this all over libstd though, so maybe it's not the issue it once was)

[^abi]: Which isn't API-compatible, but presumably has to stay around for ABI reasons. I mean, it must, right? C++ standard libraries are famously resilient to ABI issues, after all 😅...

[^hermit]: For example, the load-then-sub in notify_one is seems like it has a race that could lead to a deadlock, but there may be a reason this is impossible -- I've never spent that long thinking about it, just seems dubious, and like the kinda code that'd be real nice to replace with something that (if nothing else) will get more attention.

thomcc avatar Feb 08 '22 05:02 thomcc

On both Linux and some BSDs there's a native futex() syscall, which provides functionality similar (but more primitive) than a parking lot. Nearly equivalent functionality is available in Wasm. While not trivial, it's possible to implement our synchronization primitives directly based on such a futex. This makes all the primitives small (32-bit), efficient, and const constructible.

Please make sure you implement a fair or eventually fair (this is what parking lot does and needs the hashmap for) lock to prevent starvation.

bjorn3 avatar Feb 09 '22 13:02 bjorn3

I don't think pthread mutexes are fair so the new implementation shouldn't be required to be fair either.

nbdd0121 avatar Feb 09 '22 14:02 nbdd0121

SRWLOCK is also unfair too, pthread_mutex_t depends (POSIX doesn't require it, apple has it, I believe glibc doesn't)... So this would mean we need to a parking lot everywhere, even if the OS has otherwise good locks (like with Window's SRWLOCK).

More generally: There are a bunch of locking features that would be nice to have in an ideal world, but I think they start becoming contradictory, or only achievable easily on one platform or another -- for example, I believe the parking-lot style of lock won't respect OS priorities (unless there's a clever trick I don't know about or something). Trying to solve all of this probably has no solution, and I'm not sure that we're in the right position decide which things we want to promise yet[^1]... assuming we even want to make these sorts of promises (and I'm not sure we do).

[^1]: I feel like the place for that is afterwards we ship it, if/when bugs come in from users and such. Or at least when we're further in the implementation.

thomcc avatar Feb 09 '22 15:02 thomcc

Apple switched the default from PTHREAD_MUTEX_POLICY_FAIRSHARE_NP to the newly introduced PTHREAD_MUTEX_POLICY_FIRSTFIT_NP in macOS 10.14 (iOS and tvOS 12.0, watchOS 5.0), so they are not fair for recent macOS either (by default).

hkratz avatar Feb 09 '22 16:02 hkratz

Wouldn't implementing lock algorithms on top of futex et al. effectively mean that both the parking_lot based implementations and these other algorithms receive less test/use coverage overall? Or is there some plan to share the gnarly implementation details between parking_lot based implementations and the futex/etc. ones?

nagisa avatar Feb 12 '22 23:02 nagisa

@nagisa Yes. That's an issue with all platform-specific code, I suppose. But we do plan to also have the CI test the generic implementation (parking lot) on Linux as well, so it gets test coverage.

m-ou-se avatar Feb 16 '22 10:02 m-ou-se

I'm now reading through the various popular lock implementation to see how those are implemented.

Starting with musl.

Mutexes

Musl's pthread_mutex_t is 40 bytes (on 64-bit), although for regular/normal mutexes, only two 32-bit integers are relevant/used: the lock futex and the waiters counter. (Pthread mutexes support various attributes and different types which are all runtime settings, and thus stored inside the mutex itself. Depending on the type, other fields in the mutex are used. In Rust, however, different types are tracked in the type system, not dynamically inside the mutex.)

Its implementation is relatively simple. The lock integer contains just two bits: one to indicate whether the mutex is locked, and one to indicate whether it's contended. The contended bit can only be set when the lock bit it set as well. Locking tries to compare-and-swap 0 to 'locked' (spinning 100 times if necessary), and otherwise blocks on a 'futex wait' on the lock. Before waiting on the futex, the high bit of the lock is set and the waiter count is incremented. After waking (whether spurious or not) the waiter count is decremented.

When unlocking, a 0 is unconditionally written to the lock, and when the high bit was set or the waiter count was nonzero, a one thread is awoken through the futex.

It's not entirely clear to me how useful the waiter count is. This mechanism would also work correctly with only the 'contended'-bit. The waiter count is only used by the spin loop while locking, to avoid spinning and sleep directly when there's other waiters, to avoid stealing the lock in front of others. This makes things a little bit more 'fair'. See commit f5fb20b.

Condition variables

Musl's pthread_cond_t is 48 bytes (on 64-bit), although for 'normal' condition variables, only two pointers and a 32-bit lock/futex is used.

The lock protects a doubly linked list pointed at by the two pointers. This doubly linked list is the waiting queue, where each node is stored on the stack of a waiting thread. Each node contains a state (waiting, signalled, or leaving) and a futex that the thread is waiting on. pthread_cond_signal wakes up the first thread on the list that wasn't signalled yet. pthread_cond_broadcast marks all nodes as 'signalled', but only wakes up one thread. That thread, after woken up (and locking the mutex), will requeue the next signalled thread onto the mutex. That way, there's no need to store a pointer to the mutex in the condition variable, as pthread_cond_broadcast doesn't have to know about the mutex. It only requeues one thread. When that thread is woken up, that one will requeue the next one, and so on.

Reader-writer locks

Musl's pthread_rwlock_t is 56 bytes (on 64-bit), although it only uses two 32-bit integers in the normal case: one lock futex, and one waiters count. Its implementation is very similar to pthread_mutex_t, except 31 bits of the lock futex are used to count the number of active reader locks. The maximum value is used to indicate a writer lock. The other bit is used as the 'contended' bit, and is used for both readers and writers. It does not prioritize writers, allowing more readers to lock the lock even if there's writers waiting. The reason for that is explained in commit 50304f2: posix requires reader locks to be recursive. Proritizing writers could make a second reader lock on the same thread block.

Read-unlocking a contended lock will wake up one thread. Write-unlocking a contended lock will wake up all threads. The latter could cause another writer thread and many reader threads to wake up at the same time, potentially resulting in the writer thread winning the race and all the reader threads going back to sleep again.

The waiter counter has the same purpose as in pthread_mutex_t above.


In general it's interesting to see how many implementation details are relevant to Posix, but not to us here in Rust. For example, commit c68de0b mentions how Posix allows destroying and unmapping a mutex even if another thread hasn't returned from pthread_mutex_unlock yet. In Rust, this is impossible due to borrow rules. And on top of things like that, Posix supports more types of locks (recursive, process-shared, priority-inheriting, ..) all within the same type, which we do not need to support (within the same type).


Tl;dr: Musl uses trivial rwlocks and mutexes, except it separately counts waiters to avoid spinning when there's other waiters. Rwlocks don't prioritize writers because Posix wants reentrant reader locks. Condition variables are not trivial; they keep a waiter queue as linked list in user space, and requeue threads one by one.

m-ou-se avatar Feb 16 '22 11:02 m-ou-se

Next: glibc's mutexes. (Glibc's condition variables and rwlocks are a bit more complicated. I'll come back to those later.)

Mutexes

Glibc's mutexes are 40 bytes and support a lot of different properties. The 'normal' kind, (PTHREAD_MUTEX_TIMED_NP) however is very simple, and only uses three 32-bit fields: the lock futex, the owner thread id, and the number of users. The last two fields are useful for debugging and dead-lock/misuse detection, but not used for the basic functionality of the mutex. (The n_users field represents the current number of threads that hold the lock (0 or 1), except it doesn't get decremented when pthread_cond_wait temporarily unlocks the mutex, which means that those waiting threads are also included in this number.)

The normal mutex, unlike the many other variants, is very simple. It has three states. 0 for unlocked, 1 for locked without waiters, and 2 for locked with waiters. Locking tries to compare-and-swap 0 to 1, or otherwise swaps in a 2 and futex-wait's for things to change. Unlocking swaps a 0 in place, and futex-wake's one waiting thread if it was set to 2. Unlike musl's implementation, it does not try to spin at any point, and directly switches to sleeping if the compare-and-swap fails.

It's basically identical to the one I wrote in my experiment: https://github.com/m-ou-se/futex-lock-experiment/blob/ab0e9a135aff5f3b2e01218a3281cbc552d76653/src/lib.rs#L30-L109

m-ou-se avatar Feb 16 '22 14:02 m-ou-se

  • Has anyone looked into the condition variable darwin libpthread? Works with a futex-based api (ulock), uses 64bits of space, doesn't do extra wakes or require requeue, and is simpler than glibc's condition variable implementation.

  • Would os_unfair_lock be possible for darwin systems? It supports priority inheritance, is 32bits large and movable, but is only available on macos 10.12+

  • For non-futex/windows platforms, a thread parker + parking_lot's WordLock could be used for the Mutex. There's also windows' CONDITION_VARIABLE and SRWLOCK source to look into and some inspection suggests the latter is very similar to WordLock (which allows it to be a usize in space) but with more complications.

kprotty avatar Feb 17 '22 20:02 kprotty

Would os_unfair_lock be possible for darwin systems? It supports priority inheritance, is 32bits large and movable, but is only available on macos 10.12+

Lack of timed wait is a problem, and libstd supports back to 10.7+ (and I'm not sure what's required to reconsider this).

thomcc avatar Feb 17 '22 20:02 thomcc

Ah, fair point for the 10.7+ requirement. Does the new Mutex impl need timed wait? SRWLock and the current API don't support it.

kprotty avatar Feb 17 '22 20:02 kprotty

It's supported via condvar: https://doc.rust-lang.org/std/sync/struct.Condvar.html

thomcc avatar Feb 17 '22 20:02 thomcc

Condvar's wait_timeout functions look like they always return the MutexGuard (excluding when poisoned). This implies that the mutex itself must be locked on return, hence no need for Mutex to internally have a timed_lock(). darwin's and musl's condvar implementations help confirm.

kprotty avatar Feb 17 '22 21:02 kprotty

Continuing with glibc's condition variables:

Condition variables

Glibc's condition variables are 48 bytes, and all of them are used in normal operation. Unlike musl's implementation, this one does not keep a queue in user space, as the same implementation is used for process-shared condition variables, which cannot refer to any memory addresses that might be different in different processes.

It keeps a 64-bit sequence number that's incremented by every new waiter, and a second 64-bit number that indicates which of those have been signalled. Within the structure, there are futexes and some other fields for exactly two 'groups' of waiters: G1 and G2. Waiters add themselves to G2, and signals wake up threads in G1. When G1 is empty, signalling first swaps the two groups. This is a workaround for bug 13165, which is about time-traveling signals that end up waking future waiters rather than current waiters.

Two bits of the condition variable represent its internal lock (a trivial three state mutex), which protects the state of the condition variable.

There is some additional complexity about destroying a condition variable while there are still waiters. However, this is irrelevant for our Rust implementation, since we don't allow dropping things that are still in use.

This implementation raises the following questions for a Rust implementation:

  • Do we need a 64-bit counter? A futex (on Linux and BSD, today) is 32-bit, so having 64-bit counters makes things more complicated. A 32-bit counter could potentially overflow all the way until it reaches the same value while unlocking the mutex in condvar.wait(), although that seems extremely unlikely.
  • Does a trivial (single futex) condition variable implementation suffer from the same 'time traveling signal' bug? If so, do we consider that a bug as well in Rust?

m-ou-se avatar Feb 23 '22 15:02 m-ou-se

More on glibc's condition variables:

Their old design before the bugfix for the time-traveling signals is documented here in pseudocode: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/DESIGN-condvar.txt;h=4845251c7586c9667cd33c38af3701910d65eaa8;hb=c0ff3befa9861171498dd29666d32899e9d8145b

It keeps a (64-bit) counter for the number of signals that have been sent, and another for the number of threads that have been woken up. The difference between those two is effectively the number of signals ('tokens') that can still be consumed. If a waiting thread wakes up from the futex-wait operation and there are no tokens left (i.e. the counters are equal), it goes back to sleep. That's how a future thread can consume a token and prevent another waiting thread from waiting up from pthread_cond_wait.

m-ou-se avatar Feb 24 '22 12:02 m-ou-se

llvm libc for comparison: https://github.com/llvm/llvm-project/blob/main/libc/src/threads/linux/CndVar.h

tschuett avatar Feb 25 '22 17:02 tschuett

Finishing up glibc:

Reader-writer locks

Glibc's rwlock is documented here. It is 32 bytes, of which 22 are used during normal operation. It contains two futexes, a readers counter, a writers counter, the thread ID of the current writer, the selected mode, and a flag indicating whether it is process-shared or not. It supports three different modes: a) prefer readers (default), b) prefer writers with recursive reader locks, c) prefer writers without recursive reader locks. Only in the last mode do waiting writers block additional reader locks.

Part of its complexity is due to supporting these different modes within the same structure. Another part of the complexity is due to posix allowing destruction in cases where the Rust borrow checker would not allow it. Both of these are not relevant to our implementation in Rust.

The reader counter also contains three state bits. Using it directly as a futex is avoided, in part because that number can rapidly change as readers come and go. Therefore, one of the state bits (whether the lock is in reader or writer mode) is duplicated in a separate futex.

It also includes a mechanism through which writers can hand over a writer lock to each other without unlocking, which is only used if the lock is set to a mode that prefers writers.

It has nine states, each of which is either part of the 'read phase' (1 through 4a) or the 'write phase' (5 through 8). Even the idle state is split in both a 'read phase' (1) and a 'write phase' (5) idle state. It's entirely symmetrical, except for one more state (4a) in the read phase which is used in the non-recursive writer-preferred mode to block new waiters even when the lock is already read-locked. The transition from one of the idle states to the locked state in the other phase (2 or 7) goes through a intermediary state (3 or 6), during which a thread trying to lock the lock in the 'current phase' (read/write) can prevent the phase change and steal the lock. (This intermediary state is not used for the 'try lock' functions.) The locked states (2 and 7) are split into two states where one indicates there are waiters of the opposite phase (4/4a and 8).

Below is a state transition diagram of this lock. The dashed arrows are only used if writers are preferred. The thin arrows are only used by the 'try lock' functions.

glibc-rwlock-states drawio

The thread ID of the current writer that's stored inside the lock is only used for debugging and deadlock detection when the holder of a write lock tries to acquire a read lock.

m-ou-se avatar Mar 01 '22 11:03 m-ou-se

Next up: Wine

We don't have the source code to Microsoft's SRW locks and condition variables, but we do have the source code to Wine, which implements these as well. However, while they have to be the same size as Microsoft's implementation, they are implemented very differently and you'll see very different bit patterns in them if you inspect them while they are in use.

Reader-writer locks (and Mutexes)

On Windows we use SRW locks for both RwLock and Mutex, so there's no separate mutex implementation we're looking at. They are the size of a pointer, but Wine's implementation only uses 32 bits, regardless of pointer size. It is split into two 16-bit counters: The number of active reader locks ('owners'), and the number of waiting writers. The reader/owner counter is set to u16::MAX when it is write-locked.

The lock and unlock functions are very simple. Write-locking increments the waiter counter (using a 16-bit operation), and then uses a 32-bit CAS operation on both counters at once to acquire the lock if the owner counter is zero by setting it to u16::MAX and decrementing the waiter counter again. If the owner counter wasn't zero, it'll do a futex[^1] wait to wait until it is.

[^1]: They are not called 'futexes' on Windows, but the operations are identical: WaitOnAddress, WakeByAddressSingle, and WakeByAddressAll.

Read-locking uses a CAS operation to increment the number of owners only if there's no waiting writers and the lock isn't write-locked. So, waiting writers block new readers. If the conditions aren't met, it futex-wait's until they are.

Interestingly, two different addresses are used within the SRW lock as two different futexes. Readers wait on the address of the SRW lock itself and use the full 32-bit value, while writers wait only on the second half which contains the 16-bit owner counter. Unlocking the last reader lock will wake a single thread from the writer queue. Unlocking a writer lock will wake a single writer if there was any waiting (indicated by the waiting counter), or wake all readers if there are no waiting writers.

Mixing 16-bit and 32-bit atomic operations on overlapping memory is not common, and the Intel manual (vol. 3A) seems to suggest this is not a great idea:

Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths.

Also, surprisingly, no overflow checks are done. If either of the counter overflows, unexpected things happen.

Condition variables

Wine's condition variables are trivial. Just like SRW locks, they are the size of a pointer, but Wine's implementation only uses 32-bits. It is simply a 32-bit counter that gets incremented on every notification. It does not keep track of which lock it is used together with, nor does it make use of requeueing. It does not keep track of any information to prevent a futex-wake operation when there's no thread waiting, etc. It is identical to Condvar1 in my experiment.

m-ou-se avatar Mar 01 '22 17:03 m-ou-se

Next up: Windows

Now this one is a bit tricky because the source code isn't available, but I made some guesses and have good reason to believe they are accurate.

Reader-writer locks (and Mutexes)

On Windows we use SRW locks for both RwLock and Mutex, so there's no separate mutex implementation we're looking at. SRW locks are one pointer in size, of which the last four bits encode four flags. The other bits either represent a (60-bit or 28-bit) counter, or they represent a (64-bit or 32-bit) pointer with the last four bits missing. That pointer points to something 16-byte aligned, to make sure the last bits are always 0.

When there's no contention, the number of active read locks is stored in the counter. For a write lock, the counter is 0. The least significant bit is used to indicate whether the lock is locked (regardless of read or write mode). An unlocked lock is entirely zero.

When there are threads waiting, that is indicated by one of the four flag bits, and the counter is no longer used and instead used as a pointer to a doubly linked list of waiting threads. The nodes of that list are stored on the stack of the waiting threads. Each node also contains a pointer to the last node in the queue, so you can quickly get there from the first pointer. (The lock only stores a pointer to the first node.)

The last node contains the number of active read locks, since the pointer took the place of that counter within the lock itself.

Each node contains the thread id of the waiting thread, and an additional flag that indicates whether the thread is waiting to acquire a read or write lock.

The thread id is used with the undocumented NtWaitForAlertByThreadId and NtAlertThreadByThreadId APIs. These seem to be effectively identical to Rust's thread::park() and Thread::unpark() functions. So, SRW locks are not futex-based, but thread parker based.

I'm assuming some of the flag bits in the lock are used as a mutex to protect modifications to the queue.

A waiting writer on a read-locked lock prevent new readers from acquiring the lock, to avoid writer starvation, but it doesn't exactly prefer writers in general: it ~roughly attempts to handle the requests in order. SRW locks are not fair.

Like most lock implementations, it first spins a while before going to sleep, to avoid expensive syscalls when the lock is only locked for a very short time. Spinning seems configurable through some global. It also seems to support an alternative to spinning based on the monitorx and mwaitx instructions.

Condition variables

Condition variables on Windows are also a single pointer. They work very similar to SRW locks, storing a pointer with flag bits in the least significant bits, with the pointer pointing to a linked list over waiting threads. The nodes even have the same structure as the nodes used by SRW locks.

WakeAllConditionVariable iterates over the entire list and wakes all threads. It does not requeue waiters to the relevant lock.


Here's a diagram showing an SRW lock in different states:

srw-locks drawio

A condition variable looks nearly identical, except for the locked states and the four flag bits.


tl;dr: SRW locks and condition variables on Windows are thread parker based and keep their waiting queue as a linked list in user space through nodes on the stack of the waiting threads. Thread parking is natively supported on Windows through an undocumented API. It prefers writers to avoid writer starvation. They spin before sleeping, and can use the special monitorx and mwaitx instructions as an optimization.

m-ou-se avatar Mar 10 '22 14:03 m-ou-se

And then we have: boost (on Posix)

Mutexes

Boost's mutexes (boost::mutex) on Posix platforms are simply a wrapper around pthread_mutex_t. (So, using glibc, these are 40 bytes.) (Implementation here.)

Condition variables

Boost's condition variables (boost::condition) on Posix platforms are a wrapper around pthread_cond_t, but with an additional pthread_mutex_t. (So, using glibc, these are 88 bytes.) The mutex is used to allow for interruptions/cancellation. (Implementation here.)

Waiting on a condition variable is an interruption point. However, if waiting would result in pthread_cond_wait on the mutex that the user wants to wait on, boost would not be able to interrupt that when that mutex is still locked, since pthread_cond_wait will only return after locking that mutex. So, instead, boost's condition variable waits not on the mutex the user provided, but on its own internal mutex. To avoid race conditions, that mutex is locked on every operation, including when sending notifications.

Reader-writer locks

Boost's reader-writer locks (boost::shared_mutex) store 8-byte state protected by a mutex, together with three condition variables. (So, using glibc, these are absolutely huge: 312 bytes, spanning five cache lines.) (Implementation here.)

The state contains a 32-bit reader lock counter, and three boolean flags: writer locked, writer waiting, and upgrade. (The last one is only used for upgrading locks (read lock → writer lock).)

One condition variable is used for notifying writers, one for notifying readers, and one is used for upgrading locks.

Every operation locks the internal mutex before it inspects or changes the state. If the state doesn't allow for the thread to continue yet, it waits for it to change using the right condition variable.

Unlocking a writer lock or unlocking a the last reader lock results in notifying both the reader and writer condition variable. One writer and all readers are woken up.

This implementation isn't terribly efficient, but it's easy to prove correct as it's all protected by a single mutex.

These locks prefer writers. A waiting writer on a read-locked lock prevent new readers from acquiring the lock, to avoid writer starvation.

(There is also a 'v2' implementation which is similar but slightly smaller. (224 bytes on glibc.) It does not support upgrading and only uses two condition variables, and it only notifies one condition variable (either the readers or the writers, not both) on unlock.)


tl;dr: Boost on Posix platforms uses only standard pthread locks, and does not directly use any platform-specific things like futexes. The mutex is just whatever the underlying libc provides (see glibc and musl above). The condition variable is mostly a wrapper around libc's version, except for an additional mutex for a trick to provide thread cancellation. But its read-writer locks are not just pthread_rwlock_t, and instead implemented on top of a mutex and three condition variables, making them very big. It prefers writers to avoid writer starvation.

m-ou-se avatar Mar 10 '22 16:03 m-ou-se

Futexes on Linux seems like a good plan. Remember to read "Futexes Are Tricky" if we're going down that road: https://dept-info.labri.fr/~denis/Enseignement/2008-IR/Articles/01-futex.pdf This is basically an explanation of glibc's mutex implementation, as I understand it.

I would highly recommend just copying an existing mature C/C++ implementation of concurrency-primitives-built-on-futexes instead of trying to invent something new.

Do we need a 64-bit counter? A futex (on Linux and BSD, today) is 32-bit, so having 64-bit counters makes things more complicated. A 32-bit counter could potentially overflow all the way until it reaches the same value while unlocking the mutex in condvar.wait(), although that seems extremely unlikely.

It's not so much accidental overflow that is the concern here, but rather attacker-controlled overflow causing memory unsafety. This stuff can be surprisingly attacker-controllable if exploited in the right way.

pcwalton avatar Mar 11 '22 21:03 pcwalton

Futexes on Linux seems like a good plan. Remember to read "Futexes Are Tricky" if we're going down that road: https://dept-info.labri.fr/~denis/Enseignement/2008-IR/Articles/01-futex.pdf This is basically an explanation of glibc's mutex implementation, as I understand it.

That's an old version of that paper. A newer version is available here.

It explains the design of basically the most simple futex-based mutex by taking a bit of a detour through an obviously incorrect mutex first. (And a second implementation based on the assumption that Atomic*::swap does not exist.) The final simple mutex implementation it describes is indeed what glibc's mutex comes down to in practice. As mentioned above, it is identical to the mutex I implemented in my experiment before I started investigating the various existing implementations: https://github.com/m-ou-se/futex-lock-experiment/blob/ab0e9a135aff5f3b2e01218a3281cbc552d76653/src/lib.rs#L30-L109

The paper does not describe condition variables or read writer locks, which are far more interesting.

I would highly recommend just copying an existing mature C/C++ implementation of concurrency-primitives-built-on-futexes instead of trying to invent something new.

We haven't invented anything new here. My 'newly invented' experimental mutex implementation turns out to be identical to the one in glibc and the one in the paper you mentioned. My 'newly invented' experimental condition variable turns out to be identical to the one in GLib and Wine. (Not that these will necessarily be the ones we continue with.)

Of course we should be and will be very careful with going with anything that differs from what we already see in other implementations. The goal is not to 'invent something new', we are exploring what existing implementations do and learning which of their quirks and complexity is and isn't relevant for us, to be able to make an informed decision on what to copy and what not to copy.

My conclusion from all the C and C++ implementations I've investigated so far, is that a lot of their complexity comes from requirements that we do not have in Rust, such as allowing dropping while still in use, detecting misuse that Rust already statically detects through the borrow checker, runtime configuration such as whether to allow re-entrance or whether to prefer readers or writers, allowing thread interruption/cancellation, etc. etc. Just copying an existing implementation could mean we end up with a very complex Condvar that will be extremely hard to review and maintain, or an absolutely huge RwLock of 312 bytes that could've been 4 bytes.

It's not so much accidental overflow that is the concern here, but rather attacker-controlled overflow causing memory unsafety.

You cannot cause memory unsafety with such an overflow. The Condvar implementation in my experiment doesn't even use any unsafe code at all. Missing exactly 4294967296 notifications while unlocking the mutex in Condvar::wait would only cause the thread to go to sleep when it should not. While this may in some situations result in denial of service, that's not really an issue in the situation where an attacker has already gained enough control to completely block the thread in the exact right moment for long enough to make it miss over four billion notifications.

m-ou-se avatar Mar 13 '22 17:03 m-ou-se

GLib has a quite simple futex mutex/cond implementation without any bells and whistles, if you want to take a look at another one. It's probably not very different from the one you have though, there's only so many things you can possibly do differently.

sdroege avatar Mar 13 '22 18:03 sdroege

GLib has a quite simple futex mutex/cond implementation without any bells and whistles, if you want to take a look at another one. It's probably not very different from the one you have though, there's only so many things you can possibly do differently.

Thanks. It's indeed basically identical, except it uses an atomic add rather than a compare exchange when locking the mutex, although it does overwrite it with a 2 right afterwards if it is incremented higher than 1. This means that the state of the mutex can temporarily get higher than 2 and has a slightly higher chance of unnecessary EAGAIN returning futex wait calls. But (at least on x86) results in slightly simpler machine code. It also means it's theoretically possible to overflow the mutex causing memory unsafety.

m-ou-se avatar Mar 13 '22 18:03 m-ou-se

Would also like to recommend semaphore based synchronization primitives if it hasn't been considered already, and if simplicity (but not necessarily throughput) is a top priority.

Related, but for a performant, futex-based, RwLock there's also this algorithm.

kprotty avatar Mar 13 '22 19:03 kprotty

Here is a quick analysis of Apple Darwin's pthread implementation:

I'm not going into as much detail as the other implementations. MacOS has native (undocumented) syscalls for things like mutexes and rwlocks (psynch), but also supports a not-fully-documented futex-like syscall (ulock_wait). By default, the pthread locking types use their psynch based implementation, but there is also an alternative ulock 'futex based' implementation.

All of pthread_mutex_t, pthread_cond_t and pthread_rwlock_t in this implementation contain fields which need a higher alignment than the alignment of the struct itself. This issue is worked around by adding padding after the fields, and dynamically calculating the offset of the field. This means that this now seems to be the first implementation I've seen where the types are truly not movable. Moving one of these to a differently aligned address would shift around those 'overaligned' fields.

pthread_mutex_t is 64 bytes, pthread_cond_t is 48 bytes, and pthread_rwlock_t is 20 bytes.

Mutex

The mutex locking functions do not seem to spin in user space before calling into the underlying layer to wait for the mutex. They do atomic compare-exchange operations to avoid these calls on an uncontended lock, but do not spin for some time to wait for the lock to be released.

Similar to all other implementations, it also avoids the wake/unlock syscall on unlock when there is no contention.

Condition variables

Neither of the implementations of condition variables seem to requeue waiters onto the queue of the mutex, but instead wake up all threads at once. I can't see what __psynch_cvbroad ends up doing in the kernel, but it does not get a pointer to the mutex, so requeueing seems out of the question.

The struct does store a pointer to the mutex used by the wait operation, but it is only used for checking correct usage, and cleaning up when a thread gets cancelled.

Futex-based Condition variable

The most important part of the futex-based condition variable is the 64-bit state, containing a 32-bit sequence/notification counter, a 16-bit waiter counter, and a 16-bit counter for unconsumed signals. From a brief inspection of the code, it seems that it suffers from the same issue as the trivial condition variable: if it misses exactly 4294967296 notifications while unlocking the mutex, it can go to sleep when it shouldn't. That's nearly impossible in practice though.

The two 16-bit counters make it possible for pthread_cond_signal and pthread_cond_broadcast to skip the wakeup syscall in case no thread is waiting, and also in case all waiting threads have already been signalled but are still waking up.

Reader-writer lock

The default reader-writer lock prefers writers in the sense that new readers on a read-locked lock are blocked if there are any writers waiting. However, those new readers do seem to get priority over any writers that come later. That is, all readers and writers seem to be part of the same queue.

m-ou-se avatar Mar 16 '22 15:03 m-ou-se