heapless
heapless copied to clipboard
Possible data race in the iteration of Stack's pop and UnionNode
I will use the code from the LL-SC implementation, but the same applies to the CAS one.
https://github.com/rust-embedded/heapless/blob/fb62d12ad502db181b5fedbba22aef2f40b2a3e1/src/pool/treiber/llsc.rs#L84-L116
Between the head load in line 89 and the read to get the next pointer in line 95, another thread could have started and successfully finished a try_pop. Back to the first thread, now the top points to an in-use node that can be written to, since the storage for the next pointer and the data are the same (i.e. it's an union) we now get a data race.
I will try to trigger it with loom and miri and then try to come up with a fix. The simplest solution seems to replace the UnionNode with StructNode. However, I hope I can find a solution that doesn't incur in more memory usage, specially since I was the one that came up with the union suggestion years ago...
AFAIUI top_addr will stay the same, even when some other task adds or removes an element.
top_addr holds the address of a pointer that points to the first Node, i.e. updates to *top_addr must be atomic, which is achieved by load_linkand store_conditional.
Disclaimer: I am not an expert on atomic operations and might be totally wrong
load_link sets a flag and store_conditional clears it on the successful path. So, in my example, the second thread would have a successful store_conditional while the first one would fail. But the problem is that we read from it before the store_conditional, in line 95.
That next on UnionNode will read from the same storage as data, which the second thread made available to the application to write, thus causing a data race.
Took me a while to understand the problem.
llsc.rs
92 let top = arch::load_link(top_addr);
// `try_pop()` may be called in other thread.
94 if let Some(top) = NonNull::new(top as *mut N) {
// `try_pop()` may be called in other thread.
95 let next = &top.as_ref().next();
cas.rs
216 if let Some(mut top) = stack.top.load(Ordering::Acquire) {
// `try_pop()` may be called in other thread.
217 let next = unsafe { top.non_null().as_ref().next().load(Ordering::Relaxed) };
So calling as_ref() is still sound, given that the top node lives as long as the stack itself.
Calling next() is unsound for UnionNode since another thread may already have written the data variant.
I think even as_ref for UnionNode is unsound under stack borrows.
I was working into getting some loom + miri support for heapless. My idea is to have a PR with a failing loom test and then provide a fix, that way we can hopefully avoid such unsoundness in the future too.