embassy Fix panic safety issue in TaskStorage

Eyeballed this bug while working on #1896.

If the future constructor function passed to AvailableTask::initialize panics, TaskStorage is left invalid, where its state is STATE_SPAWNED but the future cell remains uninitialized.

This pull request adds a new state, STATE_CLAIMED, which is used to reserve an AvailableTask in claim before being initialized properly in initialize_impl. Once the future slot is initialized, state is set to STATE_SPAWNED | STATE_RUN_QUEUED with release ordering. I'm not 100% confident about memory ordering stuff, but I think release is the correct barrier to use here, since we need to prevent any writes from being reordered after this point.

Sep 14 '23 04:09 haileys

If the future constructor function passed to AvailableTask::initialize panics

This PR still leaves the TaskStorage in an unusable state. Maybe a drop guard in initialize_impl could be used to reset the state back to ELIGIBLE if the constructor panics, just so that embassy doesn't leak the storage. What do you think?

Sep 14 '23 05:09 bugadani

@bugadani That's a really good idea! I've gone with implementing Drop on the AvailableTask struct itself, which sets the task state according to a next_state field.

This also covers us in the case that the AvailableTask is claimed but then dropped before anything else is done with it, which could happen due to it being a public API.

Sep 14 '23 05:09 haileys

Thanks for the PR! I'm not sure if we want to do this though:

This is only a problem with panic=unwind, and binaries for embedded systems are usually built with panic=abort.
The consequence is only leaking the task storage, there's no actual undefined behavior (the panic happens before enqueuing, so the executor will never touch the half-initialized task. Also, since the RUN_QUEUED bit is set, a wake won't enqueue it)
The fix is not zero-cost

An alternative fix would be to wrap the closure call with a catch_unwind, and set state back to 0 on panic. This would be zero-cost since catch_unwind compiles to nothing when panic=abort.

Sep 15 '23 17:09 Dirbaio

Thanks for the feedback @Dirbaio!

I've pushed up a change that addresses the zero cost concern. We can instead make the task field an Option and use the presence of the task reference to prevent rollback to STATE_ELIGIBLE in Drop. The None case is guaranteed to optimise to a null pointer, and the optimiser should be able to inline and elide the condition in Drop in most cases too.

Sep 16 '23 06:09 haileys

it's still slightly not zero-cost. you're doing 2 atomic writes (available -> claimed -> spawned+run_queued), vs 1 as before.

Is there a reason the catch_unwind solution wouldn't work?

Sep 16 '23 09:09 Dirbaio

Is there a reason the catch_unwind solution wouldn't work?

Do you really want to stop unwinding here? I think the current Drop-solution is fine, but with a slight modification:

set the final happy-path state initially
set back to 0 in Drop
forget self if we have the spawn token

I'm relying on the compiler figuring out that drop will not be reachable if panic=abort which may or may not be the thing.

Sep 16 '23 13:09 bugadani

ah, it's true catch_unwind will stop the panic, so it's observable from the outside. The solution proposed by @bugadani sounds good to me, I think it's likely it'll optimize out with panic=abort.

Sep 16 '23 23:09 Dirbaio

I've decided to close this.

This is not a soundness issue, panicking in the future constructor will leave the task with state "spawned, run queued", but not actually queued. This means it gets just leaked, it can't cause UB because this state ensures nobody else will touch it and witness the non-constructed future.

This issue can only happen if you use a target with panic=unwind (rare) and you call TaskStorage::spawn manually (rare, most users use the #[task] macro) and that closure panics.

Fixing it is a bit complex, it'd require undo_spawn to State, to the 3 implementations we have.

Apr 06 '25 23:04 Dirbaio