heapless icon indicating copy to clipboard operation
heapless copied to clipboard

Possible data race in the iteration of Stack's pop and UnionNode

Open thalesfragoso opened this issue 7 months ago • 4 comments

I will use the code from the LL-SC implementation, but the same applies to the CAS one.

https://github.com/rust-embedded/heapless/blob/fb62d12ad502db181b5fedbba22aef2f40b2a3e1/src/pool/treiber/llsc.rs#L84-L116

Between the head load in line 89 and the read to get the next pointer in line 95, another thread could have started and successfully finished a try_pop. Back to the first thread, now the top points to an in-use node that can be written to, since the storage for the next pointer and the data are the same (i.e. it's an union) we now get a data race.

I will try to trigger it with loom and miri and then try to come up with a fix. The simplest solution seems to replace the UnionNode with StructNode. However, I hope I can find a solution that doesn't incur in more memory usage, specially since I was the one that came up with the union suggestion years ago...

thalesfragoso avatar Apr 13 '25 18:04 thalesfragoso

AFAIUI top_addr will stay the same, even when some other task adds or removes an element. top_addr holds the address of a pointer that points to the first Node, i.e. updates to *top_addr must be atomic, which is achieved by load_linkand store_conditional. Disclaimer: I am not an expert on atomic operations and might be totally wrong

jr-oss avatar Apr 14 '25 16:04 jr-oss

load_link sets a flag and store_conditional clears it on the successful path. So, in my example, the second thread would have a successful store_conditional while the first one would fail. But the problem is that we read from it before the store_conditional, in line 95.

That next on UnionNode will read from the same storage as data, which the second thread made available to the application to write, thus causing a data race.

thalesfragoso avatar Apr 14 '25 16:04 thalesfragoso

Took me a while to understand the problem.

llsc.rs

92            let top = arch::load_link(top_addr);
              // `try_pop()` may be called in other thread.
94            if let Some(top) = NonNull::new(top as *mut N) {
                  // `try_pop()` may be called in other thread.
95                let next = &top.as_ref().next();

cas.rs

216        if let Some(mut top) = stack.top.load(Ordering::Acquire) {
               // `try_pop()` may be called in other thread.
217            let next = unsafe { top.non_null().as_ref().next().load(Ordering::Relaxed) };

So calling as_ref() is still sound, given that the top node lives as long as the stack itself.

Calling next() is unsound for UnionNode since another thread may already have written the data variant.

reitermarkus avatar Apr 26 '25 17:04 reitermarkus

I think even as_ref for UnionNode is unsound under stack borrows.

I was working into getting some loom + miri support for heapless. My idea is to have a PR with a failing loom test and then provide a fix, that way we can hopefully avoid such unsoundness in the future too.

thalesfragoso avatar Apr 27 '25 13:04 thalesfragoso