heapless
heapless copied to clipboard
The llsc stack is not safe from ABA with how LL/SC is used
On the Cortex-M, STREX will always fail if the processor takes an exception between it and its corresponding LDREX operation
But the load of next
at 8000136
is done outside of the LDREX...STREX sequence.
So if the process is preempted between 8000136
and 800013a
and during the
preemption we cause the ABA problem (e.g. given h->A->B->C->0, do pop A, pop B, push A resulting in h->A->C->0 with B being in use).
Now given Cortex-M implementations might not have that problem:
A context switch might cause a subsequent Store-Exclusive to fail, requiring a load ... store sequence to be replayed. To minimize the possibility of this happening, ARM recommends that the Store-Exclusive instruction is kept as close as possible to the associated Load-Exclusive instruction, see Load-Exclusive and Store-Exclusive usage restrictions.
(quote from the ARM®v7-M ArchitectureReference Manual, linked in the sourcecode)
The problem is this is a might so depending on where you get your Cortex-M chip from this might or might not be a sound implementation.
https://github.com/japaric/heapless/blob/9ff3a5fa8968fb263cece456ccc9505dc913147e/src/pool/mod.rs#L123-L139
As a side not, I don't know which of the current sold Cortex-M based processors have that problem, neither do I know it there are any.
So this is a specification related bug, but it might not be a bug in practice. The problem is it's hard to tell if it is or is not a problem in practice.
No, it's absolutely a problem in practice. The load of next is outside of the llsc region, so it doesn't matter about the "might" in the spec. A preemption before that (or concurrent accesses in actual SMP hardware) is not going to trigger an abort. Taking advantage of LL/SC like this is just not possible with C11-based intrinsics - you need the LL(head) part of compare_exchange_weak before the load of next. This is doable in ASM, but not in the typical rust/C/etc intrinsics. The asm shown in the docs is clearly shows how this is incorrect, in order for it to be ok it would need to have reordered the ldrex
/lrd.w
instructions (and then had different register allocation):
ldrex r3, [r0]
ldr.w ip, [r2]
cmp r3, r1
bne.n 800014a <<heapless::pool::Pool<T>>::pop+0x1a>
strex r3, ip, [r0]
Meanwhile the warned-about x86-64 generation counter algorithm is safe in basically any realistic scenario.
@japaric Do you know how to fix this? Or should we just trash the use of atomics and to the critical move in an interrupt free section? As I see it interrupt free is the way to go to get deterministic execution time and the section will be extremely small.
yes, I have written this stack before using ldrex
and strex
"intrinsics" (plain functions around inline assembly) before I think that was more correct (and shorter) than what one gets with compare_exchange
. That requires a nightly toolchain because of the inline assembly and I no longer have that code around.
Or should we just trash the use of atomics and to the critical move in an interrupt free section?
not a fan of the interrupt::free suggestion. if we are going to make the code device specific, I would look into writing the push
and pop
subroutines in "external assembly", which works on stable (no time for any of that this week though)
Hmm, what are you not a fan of? It will be a few instructions long (4?) and guaranteed execution time.
So more efficient than the current ldrex
and strex
code as the current loops and checks will go away as the operation is guaranteed to work.
For keeping test-ability on x86 we can have atomics based spin-wait (same as it is now).
I'm not seeing the problem you are seeing.
the documentation states this a lock-free queue. adding a critical section breaks that guarantee and would be a breaking change in my opinion.
spsc::Queue – single producer single consumer lock-free queue
I think we disagree then, as long as you do not need mutable references it is lock free, as the pool works via immutable references. And this API will be kept, so for my view it's not a breaking change.
I'd also argue that using atomics here is over-engineered, as it will be an extremely small critical section which in worst case will give a few clocks of predictable latency in the system. And since the fix seems to entail custom assembly I'd argue even harder that atomics is not the correct thing to use here, given that we want future maintainability and simple understanding. This means we can keep atomics, but have unbounded worst case and an advanced implementation, or use critical sections and have strictly bounded worst case (which is smaller than the best case with atomics) and very easy to understand code.
I think we disagree then, as long as you do not need mutable references it is lock free, as the pool works via immutable references. And this API will be kept, so for my view it's not a breaking change.
Mutation via shared references (&-
) can be called 'interior mutability' and/or 'thread-safe' (= implements Sync
). That's a different property than 'lock-free', which means free from 'critical sections' be these achieved via disabling interrupts or using any flavor of Mutex
.
To be clear, I'm not opposed to adding a general purpose fixed-capacity queue that works with mutable references but that's a different data structure. People can layer device-specific synchronization on top of that queue to get what you are describing, e.g.
// crate: heapless
pub struct GeneralPurposeQueue<T> { /* .. */ }
impl GeneralPurposeQueue<T> {
pub fn enqueue(&mut self, item: T) -> Result<(), T> { /* .. */ }
pub fn dequeue(&mut self, item: T) -> Option<T> { /* .. */ }
}
// crate: cortex-m-queue
pub struct CortexMQueue<T>(interrupt::Mutex<RefCell<GeneralPurposeQueue<T>>>);
impl CortexMQueue<T> {
pub fn enqueue(&self, item: T) -> Result<(), T> {
interrupt::free(|cs| self.0.borrow(cs).borrow_mut().enqueue(item))
}
}
I'd also argue that using atomics here is over-engineered
Bounded (fixed-capacity) lock-free SPSC queues based on atomics are a well-known data structure that date from ~80 (I think the original work is by Lamport). There's plenty of variants (bounded, unbounded, cache-friendly, etc.) by now but I would not call any of them "over enigneered"; they all have different trade offs.
The other problem I see with a critical section based queue that has no atomics is that it's not multi-core safe, where as the lock-free queue here is.
I see that your main line of argument is "bounded time execution". I can see the use case for that and I'm fully OK with a queue that has such property but like I said above that's a different data structure so we should not be modifying the existing spsc::Queue
but rather adding a new queue that has such property.
I think we are discussing 2 different things, spsc::Queue
is fine as is. pool
and its internal Trieber stack is the one with the issue. :)
now that asm!
is stable the LL/SC version can be rewritten to use asm!
. the optimized generated assembly (target=Cortex-M4F) for the Treiber stack is
; fn push(r0: &Self, r1: NonNull<Node>)
00000000 <treiber_stack::Stack::push>:
0: /-> e850 2f00 ldrex r2, [r0]
4: | 600a str r2, [r1, #0]
6: | e840 1200 strex r2, r1, [r0]
a: | 2a00 cmp r2, #0
c: \-- d1f8 bne.n 0 <treiber_stack::Stack::push>
e: 4770 bx lr
; fn pop(r0: &Self) -> Option<NonNull<Node>>
00000000 <treiber_stack::Stack::pop>:
0: 4601 mov r1, r0
2: /----> e851 0f00 ldrex r0, [r1]
6: | /-- b130 cbz r0, 16 <treiber_stack::Stack::pop+0x16>
8: | | 6802 ldr r2, [r0, #0]
a: | | e841 2300 strex r3, r2, [r1]
e: | | 2b00 cmp r3, #0
10: | | bf08 it eq
12: | | 4770 bxeq lr
14: \--|-- e7f5 b.n 2 <treiber_stack::Stack::pop+0x2>
16: \-> 2000 movs r0, #0
18: f3bf 8f2f clrex
1c: 4770 bx lr
I'll prepare a PR