RIOT icon indicating copy to clipboard operation
RIOT copied to clipboard

core/cond.c: expect enabled interrupts in cond_wait()

Open derMihai opened this issue 1 year ago • 9 comments

Contribution description

Bug 1: cond_wait() should never be called with interrupts disabled

The way cond_wait() is currently implemented suggests that it may be called with interrupts disabled. Going to sleep with interrupts disabled doesn't quite make sense (although see below) and doesn't work either, as e.g. on ARM thread_yield_higher() triggers an interrupt. For example, the following assertion is triggered on both ARM and native:

    static cond_t cond = COND_INIT;
    static mutex_t lock = MUTEX_INIT;

    mutex_lock(&lock);
    irq_disable();
    cond_wait(&cond, &lock); // this should block forever
    assert(0); // thread_yield_higher() couldn't trigger the interrupt
    mutex_unlock(&lock);

My suggestion is to always assume interrupts are enabled. However, there is a valid, albeit hacky use-case for calling cond_wait() with interrupts disabled, and that is to guarantee [ condition check + going to sleep ] is atomic w.r. to [ condition set + wake up ] when the latter is in interrupt context:

static cond_t cond = COND_INIT;
static mutex_t lock = MUTEX_INIT; // dummy mutex only for the cond API
static bool trigger = false;

void waiter(void) 
{
    mutex_lock(&lock); 
    // disable IRQ s.t. ISR doesn't fire between reading `trigger` and `cond_wait()`
    // and we end up blocking forever
    irq_disable(); 
    while (!trigger) {
       cond_wait(&cond, &lock); 
    }
    irq_enable();
    mutex_unlock(&lock);
}

void waker_isr(void) 
{
   trigger = true;
   cond_signal(&cond);
}

Then, we can fix cond_wait() by enabling interrupts just before calling thread_yield_higher(), and restoring the interrupts state afterwards.

    ...
    sched_set_status(me, STATUS_COND_BLOCKED);
    thread_add_to_list(&cond->queue, me);
    irq_enable();
    mutex_unlock(mutex);
    thread_yield_higher();
    irq_restore(irqstate);
    ...

I'm not sure which fix is the best. The first one (in the diffs) is more pure, the second enables more use-cases.

Bug 2: cond_wait() is calling mutex_unlock() with interrupts disabled

No matter whether cond_wait() is called with interrupts disabled or not, it calls mutex_unlock() with disabled interrupts. mutex_unock() will reschedule if a higher priority thread is already blocking on the mutex, leading to the same bug as above. I therefore moved mutex_unlock() after interrupts are enabled. This still guarantees that [ condition check + going to sleep ] is atomic w.r. to [ condition set + wake up ].

derMihai avatar Oct 24 '24 08:10 derMihai

This is an issue also in other parts of core, e.g., in core/mutex and msg.

kaspar030 avatar Oct 24 '24 08:10 kaspar030

My suggestion is to always assume interrupts are enabled.

I think this makes the most sense.

kaspar030 avatar Oct 24 '24 08:10 kaspar030

This is an issue also in other parts of core, e.g., in core/mutex and msg.

Indeed, and it seems the fix is the same everywhere. Probably we should also add an arch-independent wrapper around thread_yield*() to assert interrupts are enabled.

derMihai avatar Oct 24 '24 08:10 derMihai

This is an issue also in other parts of core, e.g., in core/mutex and msg.

Indeed, and it seems the fix is the same everywhere. Probably we should also add an arch-independent wrapper around thread_yield*() to assert interrupts are enabled.

The question would be if we should define those functions as interrupt-enabling, and drop the state handling inside them (by doing irq_disable()/irq_enable() directly).

Any alternative allows mis-use (e.g., by doing irq_disable(); mutex_lock(...)), which can only be runtime detected, but it is unclear how to handle that.

kaspar030 avatar Oct 24 '24 08:10 kaspar030

The question would be if we should define those functions as interrupt-enabling, and drop the state handling inside them (by doing irq_disable()/irq_enable() directly).

I don't see a problem with that. Any prior usage that conflicts with this was a bug anyway.

derMihai avatar Oct 24 '24 09:10 derMihai

Probably we should also add an arch-independent wrapper around thread_yield*() to assert interrupts are enabled.

That would also make sense. Currently, core code does irq_restore(...); thread_yield_higher(). depending on the platform, the other way around is allowed and might save a scheduler run.

kaspar030 avatar Oct 24 '24 09:10 kaspar030

This goes deeper. Also stuff like mutex_unlock() may not be called with interrupts disabled.

So to revise the definition, it's not only stuff that might block that may not be called with interrupts disabled, but also stuff that might reschedule. And I think this is sensible. The Linux kernel assumes the same for held spinlocks (which on UP do nothing more than disabling interrupts).

derMihai avatar Oct 24 '24 10:10 derMihai

Murdock results

:heavy_check_mark: PASSED

a6b2b03c8d5b90004cd53819029e7cab4f33d307 core/cond.c: cond_wait() expect enabled interrupts

Success Failures Total Runtime
10213 0 10215 17m:31s

Artifacts

riot-ci avatar Oct 24 '24 20:10 riot-ci

I would limit the scope of this PR to the original intent, and that is to fix cond_wait(), as meanwhile I discovered a second, more important bug. I opened https://github.com/RIOT-OS/RIOT/pull/20941 to deal with the more general and complex issue of locking with interrupts disabled.

derMihai avatar Oct 25 '24 11:10 derMihai