open question about race with kernel that liburing had to fix
I found this commit to liburing some time ago. It may have no bearing on this repo because the atomic code is necessarily different anyway. But didn't want to lose track of this before I or someone else had made sure.
https://github.com/axboe/liburing/commit/744f4156b25d8630fee279cfd852ecc4c73952d4,
- https://github.com/axboe/liburing/pull/542 fixing issue
- https://github.com/axboe/liburing/issues/541
I suspect that using SeqCst is not the most reasonable solution, but I'm willing to follow liburing
My grains of salt, using SeqCst is a misinterpretation of the kernel documentation:
When using the SQ poll thread (IORING_SETUP_SQPOLL), the application needs to check the SQ flags for IORING_SQ_NEED_WAKEUP after updating the SQ tail; a full memory barrier smp_mb() is needed between.
To the Rust memory model we need to establish the right happens-after relationship between the two operations. It should be sufficient to merely use AcqRel instead, which already prevents atomic reads and writes to cross such barrier. Also, SeqCst semantically only affects the ordering with respect to other SeqCst operations which is dubious since none of that is guaranteed by any of the interfaces. Actually, a compiler_fence might already be sufficient? (Edit: the read from flags needs to Acquire for that relationship to carry through of course—as it is implemented).
After having another think about this and re-re-reading the diagrams in the original issue, I don't think I really understood the cross-language-atomic-semantics properly yet. That might mean there are not the proper tools to properly solve this in Rust. Sorry for the noise.
A newer kernel which I was reading does use appear to be using an extra acquire fence. However it establishes another order than the one required for AcqRel to be definitely sufficient: https://github.com/torvalds/linux/blob/a93289b830ce783955b22fbe5d1274a464c05acf/io_uring/sqpoll.c#L334-L352
I had a look at the code here and the same bug is present here as in liburing/the kernel. The code here is almost the same as in liburing, the only difference is that liburing is using C/C++11 whereas Rust follows the C++20 memory model but the differences between the two are not relevant to this bug.
A newer kernel which I was reading does use appear to be using an extra acquire fence. However it establishes another order than the one required for AcqRel to be definitely sufficient
The code you link to is using a full memory barrier like the documentation mentioned in your previous comment talked about, not an acquire barrier. That smp_mb__after_atomics wasn't there originally, I added it in a patch that I sent at the same time as the patch to liburing. That full barrier is meant to synchronize with a full barrier in the user application (In this case, a SeqCst fence).
As for why SeqCst is needed exactly, that requires diving a little into the C++20 standard and making some assumptions about the kernel's memory model (like assuming that smp_mb is equivalent to a SeqCst fence for our purposes). It's crucial that when the application submits new work to the kernel, it first writes to the tail field and only then reads from the flags field. Without a SeqCst fence, nothing in the C++20/Rust memory model prevents the write from being reordered with the read.
"making some assumptions" of the equivalence of these two operations is the confusing part (and this comment is the first I see that assumption spelled out as such precisely). Quite obviously Release and AcqRel only synchronize around a write-then-read pair but here we need to synchronize on the inverse. We care about the visibility of events that happened before the read: something similar to a non-existent load(release) order. There's no way to achive these with Acquire and Release semantics in combination.
Under the assumption that SeqCst interacts with smp_mb, the fix is indeed sufficient.
To put it into the semantic model for Rust and visualize happens-before relationships:
Kernel User
K1: store(flags, WAKE) U1: store(head)
| |
K2: smp_mb() U2: fence(SeqCst)
| |
V V
K3: load(head) U3: load(flags)
If K2 > U2 then K1 > K2 > U2 > U3 and user space gets the flag, whereas when K2 < U2 then U1 > U2 > K2 > K3 and the kernel observes the updated head. Since there's a synchronizing ordering relationship between any pair of SeqCst operations, including a fence, this provides the necessary correctness guarantee.
As an aside to vindicate AcqRel a little bit, replacing the load with flag.fetch_xor(0, AcqRel) is also sufficient for similar reasons as the writes then generate one of the two happens-before sequences through the operations on the memory location of flags. But that actual write pessimizes codegen, specifically on weak memory order architectures such as arm quite a bit (amusingly, on x86 llvm generates precisely the same mfence; mov sequence).
It's surprising how a SeqCst is sufficient here, but it is. That kind of non-deterministic correctness guarantee is quite unexpected on its own, no? (Do you happen to know of any good CS paper on the transfer of invariants in this manner—the responsibility of waking might as well be some other affine state? It's fascinating.)
As by your quoted section: There is an ordering relationship between two atomic writes to the same memory location ("in the modification order of M") in any case. The problem of applying it is that in the code as written only one atomic write to separate locations each occurs—so that portion can't be used for reasoning. This is what fetch_xor would solve, it introduces a second write; which allows case distinction over the order of these just like a pair of SeqCst fences. If the kernel reads the modification the RMW U3 in its RMW K1 then there is a Release-Acquire pair through M=flags and this brings the U1>U3 chain with it— so the kernel will also read the head update. If on the other hand the other modification order occurs, the atomic xor reads the wake bit already.
I had a look at the code here and the same bug is present here as in liburing/the kernel.
Is there a fix for tokio-rs/io-uring that you think presents itself, or are you saying the API, as it stands now, should come with an unsafe method with a comment saying to the effect that the application must perform the SeqCst fence between their last tail field writes and their next flags field read?
I've been involved in other work since about this issue was raised in liburing so am not in a position to test - plus I wouldn't know how to test for correctness of a memory issue like this. I wonder if anyone has been bitten by this (momentarily) but would not have noticed because if their read was handled prematurely, they would have another read happening later on anyway?
Is there a fix for tokio-rs/io-uring that you think presents itself, or are you saying the API, as it stands now, should come with an unsafe method with a comment saying to the effect that the application must perform the SeqCst fence between their last tail field writes and their next flags field read?
I think you can apply the same fix here that was used in liburing:
in submit[_...] if SQPOLL is enabled insert a SeqCst fence before calling sq_need_wakeup. Inside sq_need_wakeup itself the load can be replaced with a relaxed load since the acquire is not necessary.
As an aside to vindicate AcqRel a little bit, replacing the load with flag.fetch_xor(0, AcqRel) is also sufficient for similar reasons[...]
AcqRel is not sufficient because acquire and release are used to create happens-before relations using the synchronizes-with relation but there is no way as far as I can see to create a happens-before relation between the kernel's store to the flags field and the application's load of the flag, even if the load uses flag.fetch_xor(0, AcqRel). On x86 atomic read-modify-write operations always act like full barriers but that's not the case in all architectures.
The actual reason SeqCst solves the problem is through the coherence-ordered before relation and the constraints on the single total order S of all SeqCst operations:
An atomic operation A on some atomic object M is coherence-ordered before another atomic operation B on M if [...] — A and B are not the same atomic read-modify-write operation, and there exists an atomic modification X of M such that A reads the value stored by X and X precedes B in the modification order of M [...]
There is a single total order S on all memory_order::seq_cst operations, including fences, that satisfies the following constraints. [...] Second, for every pair of atomic operations A and B on an object M, where A is coherence-ordered before B, the following four conditions are required to be satisfied by S [...] — if a memory_order::seq_cst fence X happens before A and B happens before a memory_order::seq_- cst fence Y , then X precedes Y in S.
From the above, if neither side sees the other's write (the kernel doesn't see the store to tail and the application doesn't see the store to flags) then each side's read is coherence ordered before the other's write which means each side's respective SeqCst fences must precede each other in the total order S which is impossible. This means that at least one side must see the other's write.
Never mind the AcqRel, it's admittedly a stupid solution that shouldn't be necessary and I've omitted the details on purpose. (Though: edited in above, discuss in another thread if you want¹) The kernel should just ensure that fence(SeqCst) and smp_mb are compatible and in practice it'll only be sensible to do that. I don't think any of the rest of your comment disagrees with the explanation of the two concrete, possible causal relationships ("coherency-order" / "happens-before" is common in CS literature) chains between the annotated RMW operations already provided in the details comment above. Thank you for providing the direct quotation of relevant sections. Proof by contradiction appears to work just as well as the proof over case distinction ("There is a single total order S").
So yes, I've been convinced enough this library should be using SeqCst when the wake flag must be observed reliably. I don't think the operation without a fence needs to be unsafe though. There's no safety problem associated with this failure to synchronize—blocking forever is memory safe as far as Rust is concerned. I agree it shouldn't be the default behavior and deserves expert treatment, providing a separate sq_need_wakeup_with_intermittent_seqcst might work?
Big thanks to you for looking into this again. In the past year, I've gained a new appreciation for just how many crates are building their io_uring features off of this one. The Rust on Linux community really benefits from getting this right.
Can I reopen this just long enough to raise a related question given the work and PR you-all did yesterday?
There are these two reads now: https://github.com/tokio-rs/io-uring/blob/4e52bca1f2c3ee24b2a51f4b5d9f3d8652f26cab/src/submit.rs#L57C5-L72C1
/// Whether the kernel thread has gone to sleep because it waited for too long without
/// submission queue entries.
#[inline]
fn sq_need_wakeup(&self) -> bool {
unsafe {
(*self.sq_flags).load(atomic::Ordering::Relaxed) & sys::IORING_SQ_NEED_WAKEUP != 0
}
}
/// CQ ring is overflown
fn sq_cq_overflow(&self) -> bool {
unsafe {
(*self.sq_flags).load(atomic::Ordering::Acquire) & sys::IORING_SQ_CQ_OVERFLOW != 0
}
}
Without a comment explaining why, it is glaring that they perform the loads with different atomic ordering.
Fair to say the three of you know this stuff way better than I so I'm just in a position to ask the question. Is the second one Acquire and not Relaxed related to no fence being required for it?
These two functions are not public so their documentation or comments don't have to say much but perhaps while this context is still fresh for you, one of you could suggest a comment for one or the other that says why their load instructions are different?
The load in sq_need_wakeup is relaxed because we are not acquiring anything and the kernel's setting of the NEED_WAKEUP flag is not done using a release operation. The SeqCst fence that precedes the call to sq_need_wakeup gives us all the synchronization we need since the only constraint on that code is that if we submitted new work right as the kernel's SQPOLL thread went to sleep, we need to be able to see that and wake it up.
I believe the load in sq_cq_overflow can also be relaxed since again we are not acquiring anything and the kernel doesn't set that flag with a release store. That flag is used to indicate that the CQ overflowed and in order to get the extra CQEs we need to enter the kernel with GETEVENTS. This will cause the kernel to flush the overflowed entries to the CQ after which it will perform a release store to the CQ's tail. When we later try to get completions, we acquire the CQ's tail, resulting in correct synchronization.
Edit:
FWIW, liburing's sq_cq_overflow equivalent also uses a relaxed load.
Thank you for these details. At least for now to me, it seems much clearer.
@quininer Would you be amenable to a PR for the sq_cq_overflow function being relaxed?
@FrankReh of course.