crystal Add WaitGroup synchronization primitive

This is more efficient than creating a Channel(Nil) and looping to receive N messages: we don't need a queue, only a counter, and we can avoid spurious wake ups of the main fiber and resume it only once.

See the documentation for examples and more details.

Jan 04 '24 12:01 ysbaddaden

Note to self: this (or rather its MT equivalent) is also known as a latch in C++ and Java

Jan 04 '24 12:01 HertzDevil

As food for thought: https://cs.opensource.google/go/go/+/refs/tags/go1.21.5:src/sync/waitgroup.go Go seems to merge the lock and counter int a single atomic field. Which makes sense considering that a spin lock is really just another atomic counter. You probably won't need the full bit width for either of them. This removes the need to keep track of waiting fibers, they just stay blocked on the lock 🤯 Maybe we could explore this in a future refactor?

Jan 04 '24 12:01 straight-shoota

Go merges the counter (i32) and the number of waiters (u32) into a single atomic (u64) which is a neat idea, then uses a semaphore to suspend the goroutines.

What I'm curious about is:

how is the semaphore implemented?
what about architectures without a 64-bit atomic?

Jan 04 '24 13:01 ysbaddaden

I'm afraid following on Go won't be possible because their implementation takes advantage of the atomics returning the new value (e.g. add to counter => returns new counter + waiters), but Crystal relies on LLVM atomics that always return the old value, which... is often pointless :disappointed:

For example, to support add(-5) I must do the math twice :facepalm:

old_value = @counter.add(n)
new_value = old_value + n

Jan 04 '24 16:01 ysbaddaden

~~I enabled the waitgroup spec for the interpreter (and disabled the stress test because the interpreter is too slow)... and the interpreter crashed on CI:~~ Errata I didn't push it, yet, so the crash is unrelated to the WaitGroup specs (maybe related to #14252?).

FATAL: can't resume a running fiber: #<Fiber:0x7f7b0181baa0: main>
  from src/crystal/scheduler.cr:148:7 in 'swapcontext'
  from src/compiler/crystal/interpreter/interpreter.cr:354:7 in 'interpret'
  from src/indexable.cr:574:11 in '->'
  from src/fiber.cr:146:11 in 'run'
  from ???

Jan 25 '24 17:01 ysbaddaden

This failure consistently appears in https://github.com/crystal-lang/crystal/pull/14122 as well.

Jan 29 '24 12:01 straight-shoota

And in #14257.

Jan 29 '24 13:01 ysbaddaden

There's a new RFC about the WaitGroup API: https://github.com/crystal-lang/rfcs/pull/3 It presents the discussion of this PR in a more structured manner.

Feb 06 '24 13:02 straight-shoota

Updated to follow the latest changes in RFC #2: resume waiting fibers when the counter becomes negative + make #wait raise when the counter is negative.

Mar 25 '24 12:03 ysbaddaden

src/none.cr seems to be added by accident?

Mar 25 '24 12:03 Sija

At the cost of using a compare_and_set loop, it would be possible to "saturate" counter decrements at 0 and provide a guarantee all future add/done operations on a waitgroup that is in the 0 state will not change the counter, and raise. This increases correctness, but will decrease throughput of a heavily contended waitgroup.

The performance implications depend on whether in practice waitgroups will be protecting "large" operations (i.e. the time each thread spends maintaining the waitgroup is not dominant, and each cas op is very likely to succeed) or "small" operations (i.e. the thread spends most of it's time maintaining the waitgroup, and each cas op is likely to fail). My intuition is that it's the former and the performance implications of using a cas loop would be tiny.

Apologies for bringing this up at a "late" stage in the PR, this didn't occur to me until now.

Mar 26 '24 14:03 RX14

@RX14 Interesting. I'm not concerned about performance (it's still lock free so it's fine to me), but:

the counter starts initialized at zero by default, so add(n) wouldn't detect an invalid waitgroup anymore or we'd require the counter to not be initialized at zero;
as outlined in the RFC the counter may reach zero multiple times and still be valid (with MT), as long as we always increment before we decrement and start waiting.

Or am I missing something?

Mar 28 '24 10:03 ysbaddaden

Thanks @RX14. I'm noticing more edge cases. For example we could reach a negative number (raises) then a concurrent fiber would increment back to a positive number and fail to raise because the new counter is positive, also impacting the resumed waiting fibers that may continue despite the waitgroup being left in an invalid state.

I'll likely to add a CAS loop, just not saturating at zero, but keep the negative number to detect the invalid state.

Mar 28 '24 10:03 ysbaddaden

I eased out corner edges:

can't increment from a negative counter anymore (forever invalid state);
raises on positive counter after resume (confused state).

I'm a bit torn about the last one: the situation can happen when the counter reached zero, enqueued waiters, and continued to increment, which is invalid, yet there is a race condition when reusing a WaitGroup with at least 2 waiters: fiber A reuses the WaitGroup (i.e. increments) before fiber B resumes (positive counter -> raise).

Oh, the race condition would also trigger with a negative counter (lower probability but could happen), so the problem is reusing the object before all waiters are properly resumed.

Ah, the joys of writing a synchronization primitive :feelsgood:

Mar 28 '24 13:03 ysbaddaden

My highest concern is deadlocks: any condition where the counter remains on zero but fails to resume a waiter. There are race conditions when the waitgroup is used weirdly, but they do not cause a wrong count or a deadlock so can fail in a better way (raising).

Mar 28 '24 15:03 RX14

AFAIK it should now be impossible to fail to wake the waiting fibers or leave the waitgroup in a confusing state: the counter saturates to a negative number and can't return to a positive number anymore; waiting fibers are always resumed (once) when the counter reaches zero or below.

I can't think of any scenario where we'd end up with a deadlock.

I can still think of race conditions, though:

counter reaches zero
waiting fibers are enqueued
counter reaches a negative number

Depending on when the fibers are resumed, some may return successfully (zero counter) while some may raise (negative counter), yet at least one fiber will raise (the one decrementing the counter below zero), so the error shouldn't go unnoticed.

Mar 29 '24 10:03 ysbaddaden

I cannot wait to use this over some Channels that I have.

Apr 01 '24 13:04 jkthorne

Same! very excited for this one :tada:

Apr 01 '24 14:04 bararchy

crystal crystal copied to clipboard

Add WaitGroup synchronization primitive

crystal
crystal copied to clipboard