ch32-hal otg_fs / usbfs potential lock up

Writing this more or less as a PSA, the otgfs / usbfs driver seems to be able to enter a lock up state where an interrupt can fire and call the waker but the waker is not woken up.

currently debugging this issue, feel free to help if interested.

Nov 01 '24 08:11 Codetector1374

just to check, are you sure the waker is not being woken, and not that it is being woken but returns pending again even though the event you were waiting for should have triggered?

i had a quick look at the usbfs code for the first time just now and one thing that sticks out to me is that you’re checking interrupt flags etc in the poll_fns. the reason i did this in the interrupt handler in my stab at usbhs is that the hardware is very prone to letting you miss interrupts if you aren’t careful: when you receive a transfer complete interrupt, you will “lose” the information stored in the registers if another transfer occurs, including rx_len, which happens fairly often. so the poll_fn never sees the interrupt it’s looking for (the hardware sets the endpoint to nak after the ack is sent… i thought? it doesn’t say this in the RM now that i’m reading it. well, if it did, you won’t even get the next transfers interrupt which would have a screwed up datatoggle)

in fact, i suppose it is possible that the interrupt handler is “preempted” by the hardware: say you read the int_st, then another transfer completes, then you read rx_len: that rx_len would be for the wrong transfer. race conditions should be checked for by clearing the interrupt flag early, then making sure it is still cleared at the end, else send an error for the endpoint (we can’t be sure what we read was correct as we might have raced) and loop back around to handle the new transfer (or we could just let the interrupt handler fire again… but i don’t really trust the interrupt latency on risc v for that)

two other things: https://github.com/ch32-rs/ch32-hal/blob/e2bee219ec19932488e433d22c8c35ae041b7664/src/otg_fs/endpoint.rs#L168 this assertion is wrong. the host is 100% allowed to send a shorter packet then we were expecting: that’s what this check is for https://github.com/embassy-rs/embassy/blob/6e0b08291b63a0da8eba9284869d1d046bc5dabb/embassy-usb/src/lib.rs#L418 and also: https://github.com/ch32-rs/ch32-hal/blob/e2bee219ec19932488e433d22c8c35ae041b7664/src/otg_fs/endpoint.rs#L176 checks like this one are not needed since it should totally be possible to have an IN and OUT endpoint on the same number (except for control endpoints)

Nov 01 '24 10:11 ExplodingWaffle

First of all, thank you so much for review our code! Really could use an extra pair of eyes.

just to check, are you sure the waker is not being woken, and not that it is being woken but returns pending again even though the event you were waiting for should have triggered

Well, I believe so but will double check again later today. We added some prints, the prints acquire a critical section so they should be "globally ordered". (debugging IN endpoint write, where have a more or less consistent repro)

Code with more prints: https://github.com/vapor-keeb/ch32-hal/tree/0b97d20327f53677d55baf3dc784002c2d7a730a/src/otg_fs Resulting log is: https://gist.github.com/Codetector1374/ddccb83b2816a4a1ba06675c170a0abb

[TRACE]endpoint.rs:86  : endpoint 1 IN
[TRACE]endpoint.rs:129 : suprious wakeup
[TRACE]    mod.rs:60  : irq: 6a
[TRACE]    mod.rs:86  : Walking up: 1
[TRACE]   main.rs:74  : <0,1> changed to High

Above is where the "bug" shows up, by seeing both endpoint 1 IN and suprious wakeup (I know it is not actually spurious, poll_fn is expect to run once, can re-run because of other futures completed in a same join etc.) we know we have for sure registered ourself on the waker, then we can see irq 6a which would be a transfer IRQ, and it did in fact wake up Endpoint 1. However, we never see a suprious wakeup again (or the request actually completing)

This is what led me to believe there is a lost wakeup.

checking interrupt flags etc in the poll_fns

Our understanding has been that with int_busy below the hw will automatically NAK all further host request without disturbing any register until the int_fg.transfer is cleared. Which we clear in our poll_fns. https://github.com/ch32-rs/ch32-hal/blob/d7f1b264dfd6f66e629f61be40253e09dae1b16c/src/otg_fs/mod.rs#L226 So I have actually never seen a case (yet) that the registers are clobbered / lost for otg_fs as long as the int_busy is set. (which the WCH example code also does)

this assertion is wrong.

That is a good catch. I think we didn't actually think that through as much when we wrote that assertion.

checks like this one are not needed since it should totally be possible to have an IN and OUT endpoint on the same number (except for control endpoints)

I see, I did in fact have a misunderstanding of the usb spec. But I guess this is "technically not broken yet" but good point we probably should fix that in the future. Screenshot 2024-11-01 at 8 18 05 AM Screenshot 2024-11-01 at 8 17 56 AM

Nov 01 '24 15:11 Codetector1374

Our understanding has been that with int_busy below the hw will automatically NAK all further host request without disturbing any register until the int_fg.transfer is cleared. Which we clear in our poll_fns.

Ah, I'd glossed over int_busy. That makes sense (as long as there isn't a bug with it or something)

As for your problem: I'm not really sure. maybe the waker is not registered somehow?! unfortunately AtomicWaker doesn't give you a way to check. only other guess is an executor bug but that seems unlikely. might have a proper look tomorrow, if I can repro I have a usb analyzer and might be able to get a debugger working.

Nov 02 '24 01:11 ExplodingWaffle

I have an analyzer too, I don't believe there is very much interesting going on the bus. If you are interested I'm more than happy to schedule an interactive session / voice call where we can debug this.

Nov 02 '24 01:11 Codetector1374

Very interesting, turned off integrated-timer in embassy-executor and added generic-queue in embassy-time and the issue is no longer reproducible...

Still unsure what's going on TBH.

Nov 02 '24 07:11 Dummyc0m

In embassy-executor/src/raw/mod.rs

/// Wake a task by `TaskRef`.
///
/// You can obtain a `TaskRef` from a `Waker` using [`task_from_waker`].
pub fn wake_task(task: TaskRef) {
    let header = task.header();
    if header.state.run_enqueue() {
        // We have just marked the task as scheduled, so enqueue it.
        unsafe {
            let executor = header.executor.get().unwrap_unchecked();
            executor.enqueue(task);
        }
    }
}

/// Wake a task by `TaskRef` without calling pend.
///
/// You can obtain a `TaskRef` from a `Waker` using [`task_from_waker`].
pub fn wake_task_no_pend(task: TaskRef) {
    let header = task.header();
    if header.state.run_enqueue() {
        // We have just marked the task as scheduled, so enqueue it.
        unsafe {
            let executor = header.executor.get().unwrap_unchecked();
            executor.run_queue.enqueue(task);
        }
    }
}

It looks like the executor.run_queue.enqueue error is not checked. @Codetector1374 found that we have the flag set in the state but no task on the run_queue.

Edit: just kidding, the run_queue.enqueue doesn't return an error.

Nov 02 '24 22:11 Dummyc0m

Hummm looks like either we have a hardware bug in the atomic instruction or there is some bug in embassy_executor

Basically TaskHeader::state is 3 which indicate we are "enqueued" but the run_queue in the executor looks empty.

(gdb) p *(0x20000060 as *const embassy_executor::raw::TaskHeader)
$2 = embassy_executor::raw::TaskHeader {
  state: embassy_executor::raw::state::State {
    state: core::sync::atomic::AtomicU32 {
      v: core::cell::UnsafeCell<u32> {
        value: 3
      }
    }
  },
  run_queue_item: embassy_executor::raw::run_queue::RunQueueItem {
    next: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
      value: core::cell::UnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
        value: core::option::Option<embassy_executor::raw::TaskRef>::None
      }
    }
  },
  executor: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<&embassy_executor::raw::SyncExecutor>> {
    value: core::cell::UnsafeCell<core::option::Option<&embassy_executor::raw::SyncExecutor>> {
      value: core::option::Option<&embassy_executor::raw::SyncExecutor>::Some(0x20007fec)
    }
  },
  poll_fn: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<unsafe fn(embassy_executor::raw::TaskRef)>> {
    value: core::cell::UnsafeCell<core::option::Option<unsafe fn(embassy_executor::raw::TaskRef)>> {
      value: core::option::Option<unsafe fn(embassy_executor::raw::TaskRef)>::Some(0x5152 <embassy_executor::raw::TaskStorage<vapor_keeb::____embassy_main_task::{async_fn_env#0}>::poll<vapor_keeb::____embassy_main_task::{async_fn_env#0}>>)
    }
  },
  expires_at: embassy_executor::raw::util::SyncUnsafeCell<u64> {
    value: core::cell::UnsafeCell<u64> {
      value: 18446744073709551615
    }
  },
  timer_queue_item: embassy_executor::raw::timer_queue::TimerQueueItem {
    next: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
      value: core::cell::UnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
        value: core::option::Option<embassy_executor::raw::TaskRef>::None
      }
    }
  }
}



(gdb) p *(0x20007fec as *const embassy_executor::raw::SyncExecutor)
$4 = embassy_executor::raw::SyncExecutor {
  run_queue: embassy_executor::raw::run_queue::RunQueue {
    head: core::sync::atomic::AtomicPtr<embassy_executor::raw::TaskHeader> {
      p: core::cell::UnsafeCell<*mut embassy_executor::raw::TaskHeader> {
        value: 0x0 <defmt::export::fmt<embassy_usb::class::hid::ReportId>>
      }
    }
  },
  pender: embassy_executor::raw::Pender (
    0x0 <defmt::export::fmt<embassy_usb::class::hid::ReportId>>
  ),
  timer_queue: embassy_executor::raw::timer_queue::TimerQueue {
    head: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
      value: core::cell::UnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
        value: core::option::Option<embassy_executor::raw::TaskRef>::None
      }
    }
  },
  alarm: embassy_time_driver::AlarmHandle {
    id: 0
  }
}```

Nov 02 '24 22:11 Codetector1374

gg

Nov 02 '24 23:11 Dummyc0m

Ugh.... sounds like we don't have atomics....

Thanks..... WCH

Nov 02 '24 23:11 Codetector1374

to elaborate, we switched the state/run_queue impl within embassy-executor from atomic to critical section, and the problem stopped reproducing.

It could be the atomic impl is wrong, but neither of us could come up with a flaw, and given WCH's track record...

Nov 02 '24 23:11 Dummyc0m

you know… i was wondering why they removed atomics from some of the new V3 cores. might be why 🙃

(this Exists wrt testing atomics on riscv. never even touched it myself so no idea how hard it would be to run. all that really matters in practice though is “it works to spec” or “nope” so if one actually cared to prove if it doesn’t work it might just be easier to come up with a single counterexample)

Nov 03 '24 00:11 ExplodingWaffle

you've probably got this down as a lost cause :) but it still strikes me as odd that they could mess this up and still advertise it as a feature, so i did a bit of reading about atomics on riscv and there is one smoking gun. (one caveat: I haven't seen if your issue is caused by LR/SC or AMOs, but i'm assuming LR/SC because i really can't see what could go wrong with an AMO)

exhibit a, unpriv ISA 14.2. "Zalrsc" Extension for Load-Reserved/Store-Conditional Instructions:

A store-conditional instruction to a scratch word of memory should be used to forcibly invalidate any existing load reservation: • during a preemptive context switch, and

exhibit b, priv ISA 3.3.2. Trap-Return Instructions:

If the A extension is supported, the xRET instruction is allowed to clear any outstanding LR address reservation but is not required to. Trap handlers should explicitly clear the reservation if required (e.g., by using a dummy SC) before executing the xRET.

exhibit c, https://github.com/torvalds/linux/blob/a99efa00891b66405ebd25e49868efc701fe1546/arch/riscv/kernel/entry.S#L110

Dangling load reservations can be the result of taking a trap in the middle of an LR/SC sequence, but can also be the result of a taken forward branch around an SC -- which is how we implement CAS. As a result we need to clear reservations between the last CAS and the jump back to the new context.

it's a little sneaky especially considering "preemptive context switch" isn't actually defined anywhere, but maybe ISRs should be invalidating the reservation set to avoid problems in CAS everywhere else? basically: maybe try throwing in a dummy SC before mret here? https://github.com/ch32-rs/qingke/blob/73a21d1b527327c17bd80b8722d66e10a4fcd46e/qingke-rt/macros/src/lib.rs#L238

Nov 04 '24 11:11 ExplodingWaffle

Let me try this today you might be correct. But I think at least reading the Linux code it sounds like it is really only required if we are "saving / restoring" program state on a different "HART". Because on the same hart, with interrupts,

lr <variable A>
 ----- interrupt
lr <variable A>
// do stuff
sc <variable A>
mret
---- resume
sc <variable A> should fail... because LR multiple times on the same address is NOP, and it should've cleared it by running the store conditional).

Nov 04 '24 16:11 Codetector1374

Ohhhhhhh I had a "shower thought" I think I understand how can this be messed up:

We have the hardware save and restore enabled..... What if ... What if.... the hardware save restore for the interrupt context is save and restoring the address reservation? that would allow the second store conditional to succeed! I probably should also try that.

Nov 04 '24 16:11 Codetector1374

Actually just tried

disabling the hardware interrupt save restore
adding a sc.w before the mret

neither fixed the problem, unfortunately.

Looks like we will just have to disable atomics for now.

Nov 05 '24 05:11 Codetector1374

worth a try. i should really just find the time to repro this myself 😅. i think you are correct that a “real” preemptive scheduler would be needed to cause problems (the dangling lr) here, not just interrupts. i don’t even think nested interrupts (which i believe are enabled, and used by time-driver-systick) could cause this one. thanks for your efforts 👍

Nov 05 '24 11:11 ExplodingWaffle

Oh I repro this just mashing a key on my “hid example” Check vapor-keeb/vapor-keeb’s main

and was using tick timer 1

Nov 05 '24 14:11 Codetector1374

Discovered by accident lol. The repro process is kinda funny.

Nov 05 '24 14:11 Dummyc0m