dwc_otg driver causing complete system freeze in stable 6.6.28 kernel (Home Assistant OS, RPi OS)
Describe the bug
With the upgrade of Home Assistant OS to latest stable 6.6 kernel, we started to get reports of boot loops when some USB devices are connected: https://github.com/home-assistant/operating-system/issues/3362
Further investigation shown it's caused by the default dwc_otg driver which causes a complete system freeze, with watchdog restarting the device shortly after. I managed to reproduce the same issue on RPi OS (both 32bit and 64bit) using steps described below, with kernel 6.6.20 from the current OS image and latest 6.6.28 from the APT repo. It's still not completely clear to me if it's only reproducible with FIQ enabled, because in my testing it seemed stable without it, however, changing to dwc2 seems to reliably resolve the issue.
There are some reports that also some other USB devices (Zigbee sticks) trigger the same issue. RPi 3B seems to be the most common but there's anecdotal evidence of it happening on RPi 4B as well. We also have reports of downgraded performance of ZB sticks on RPi 4 and 5 (not leading to freeze/boot loop) but it's unclear yet if this is related: https://github.com/home-assistant/operating-system/issues/3352
I'll be happy to perform any further tests or ask other people for more details to get this one sorted out.
Steps to reproduce the behaviour
- Install Home Assistant OS 12.3 (based on stable downstream RPi kernel 6.6.28).
- Plug in Z-Wave.me UZB stick.
- Set up Z-Wave / start the Z-Wave JS add-on which initiates communication with the USB ACM device
- System immediately freezes.
Alternatively, on RPi OS:
- Install Docker.
- Plug in Z-Wave.me UZB stick.
- Start the Z-Wave JS UI container:
docker run --rm -it -p 8091:8091 -p 3000:3000 --device=/dev/serial/by-id/usb-0658_0200-if00:/dev/zwave --mount source=zwave-js-ui,target=/usr/src/app/store zwavejs/zwave-js-ui:latest - Fill in any Z-Wave keys in the web UI and save the config.
- System immediately freezes.
Device (s)
Raspberry Pi 3 Mod. B
System
pi@rpios:~ $ cat /etc/rpi-issue
Raspberry Pi reference 2024-03-15
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 11096428148f0f2be3985ef3126ee71f99c7f1c2, stage2
pi@rpios:~ $ vcgencmd version
Apr 17 2024 17:29:03
Copyright (c) 2012 Broadcom
version 86ccc427f35fdc604edc511881cdf579df945fb4 (clean) (release) (start)
pi@rpios:~ $ uname -a
Linux rpios 6.6.28+rpt-rpi-v7 #1 SMP Raspbian 1:6.6.28-1+rpt1 (2024-04-22) armv7l GNU/Linux
Logs
[ 142.879733] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 142.885902] rcu: 3-...0: (1 GPs behind) idle=5b8c/1/0x4000000000000000 softirq=37511/37513 fqs=5089
[ 142.895112] rcu: (detected by 2, t=21012 jiffies, g=69409, q=393 ncpus=4)
[ 121.819373] WARN::dwc_otg_hcd_urb_dequeue:638: Timed out waiting for FSM NP transfer to complete on 3
[ 142.879733] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 142.885902] rcu: 3-...0: (1 GPs behind) idle=5b8c/1/0x4000000000000000 softirq=37511/37513 fqs=5089
[ 142.895112] rcu: (detected by 2, t=21012 jiffies, g=69409, q=393 ncpus=4)
[ 142.903141] [ 156.130975] mmc1: Timeout waiting for hardware interrupt.
Task dump for CPU 3:
[ 142.903147] task:node state:R running task stack:0 pid:2956 ppid:2883 flags:0x00000202
[ 142.903166] Call trace:
[ 142.903172] __switch_to+0xe8/0x168
[ 142.903192] 0x0
[ 156.130975] mmc1: Timeout waiting for hardware interrupt.
Additional context
Might be closely related to #6100 but unlike there, even latest kernel from the 6.6.y branch (6.6.30) did not fix the issue.
Is this a regression? i.e. has this ever been reliable with an older kernel?
You can install historical kernels using rpi-update <hash> to confirm.
Is this a regression? i.e. has this ever been reliable with an older kernel?
It is definitely a regression on Home Assistant OS, it is resolved by reverting back (HAOS uses A/B boot mechanism) to build using kernel tag stable_20240124 (6.1.73) , it is reproducible with stable_20240423 (6.6.28). I am not aware of any similar issues in the past, and there are not any relevant changes in HAOS tree between those two builds that could be the cause.
I'll test an older RPi OS kernel and report back shortly.
I downgraded 32bit RPi OS to 6.1 from the stable branch (6.1.73) using rpi-update 6c2b033bf556c2a2ae109ec85d86485fa4c16050 and I confirm I can not reproduce it there either. So I think we can safely call it a 6.6 regression.
rpi-update 5fc4f643d2e9c5aa972828705a902d184527ae3f should get you the most recent 6.1 kernel (6.1.77).
rpi-update 7fa525a8a7d42235a8eaa52f5e3636ede9073225 should get you the oldest 6.6 kernel (6.6.5).
If the first works and the second fails, then it's likely the switch to 6.6 tree. If not, then it's one of the commits on 6.1 or 6.6 and we may be able to narrow down further.
- 6.1.77 does not manifest the issue.
- 6.6.5 doesn't boot at all (double-checked on 32bit and 64bit OS):
Possibly the boot failure is due to 4a8f7f7661252072494ac16d3edc035193c6ea04
Maybe rpi-update 07ff8bbae5c5e6a52c61ca062fdb181fd80202bc is the first build (6.6.20) with that fix.
Moved a bit forward in the Git history and re-tested with hash 7c8a2bd9d4cc862929eb49d0c3cef2ffc59a365d (6.6.8), issue is present, last message on HDMI console before the system froze:
(FWIW USB enumeration errors are another known issue of this particular USB device: https://github.com/home-assistant/operating-system/issues/2995)
Yes, looks like rpi-update 7c8a2bd9d4cc862929eb49d0c3cef2ffc59a365d (6.6.8) is the first build with the linked commit (and is a very early build on the 6.6 tree).
So seems it started with move to 6.6 tree (which doesn't narrow it down too much).
Where can i download an older version that works outside of the raspberry pi imager and flash to the device?
An older version of RPiOS? There's a lot of historical versions here: https://downloads.raspberrypi.com/
An older version of RPiOS? There's a lot of historical versions here: https://downloads.raspberrypi.com/
yes, I have been having lots of other zwave issues so kept installing updates in an attempt to fix it. Then i think I am stuck because of this issue. So, was trying to restore a backup, by flashing device from imager (version from 5/8/2024). Unplugged usb devices and the system starts. Is there a command from the console to rollback, without having to image an older version?
You can revert bootloader/firmware/kernel with rpi-update. There is no way to revert all of apt.
reflashed 12.2 to the sd card. will install a backup before the 12.3 upgrade. And then wait for a fix.
update i am back running again on 12.2.
If I remember correctly, at least one of the Z-Wave dongles is/was seriously non-USB standards-compliant. @P33M?
Ah yes, the stick I have the issue with is mentioned in https://forums.raspberrypi.com/viewtopic.php?f=28&t=245031#p1502030 and #3027.
However, this was causing problems with Pi4 and not the Pi3, on which the current problem presents. It would be interesting to see if adding a hub in between solves the issue or not.
The USB stick from sairon is a different one though.
Might be closely related to #6100 but unlike there, even latest kernel from the 6.6.y branch (6.6.30) did not fix the issue.
The last kernel do not fix the issue for me. the dwc2 driver fix it for zwave stick, but break another think. I am still with 6.1.77 kernel.
If I remember correctly, at least one of the Z-Wave dongles is/was seriously non-USB standards-compliant. @P33M?
It was an Aeotec dongle and the symptom there was "failure to enumerate" not a hang during use. Nothing substantial changed in dwc_otg between 6.1 and 6.6 - the fact that mmc dies as well as USB points to some fundamental breakage.
Wondering what's the current status on this (obviously kind of major) issue after a silence of 2 months?
For what its worth, I have a similar problem with an ethernet-usb adapter.
Raspberry Pi 4 Bookwork Desktop (64-bit) with 1 RJ45 -> LAN 1 USB -> Analog Devices Pluto SDR 1 USB -> ASIX USB/Ethernet Dongle -> RJ45 on a DDMAL HDMI Video Encoder
The Pluto always connects, but the Dongle does not.
pi@txtouch:~ $ uname -a Linux txtouch 6.6.31+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux
pi@txtouch:~ $ lsusb Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 004: ID 0b95:7720 ASIX Electronics Corp. AX88772 Bus 001 Device 003: ID 0456:b673 Analog Devices, Inc. LibIIO based AD9363 Software Defined Radio [ADALM-PLUTO] Bus 001 Device 002: ID 2109:3431 VIA Labs, Inc. Hub Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
pi@txtouch:~ $ nmcli device
DEVICE TYPE STATE CONNECTION
eth0 ethernet connected Wired connection 1
eth1 ethernet connected Wired connection 2
lo loopback connected (externally) lo
eth2 ethernet connecting (getting IP configuration) Wired connection 3
wlan0 wifi disconnected --
p2p-dev-wlan0 wifi-p2p disconnected --
pi@txtouch:~ $ nmcli monitor NetworkManager is running eth2: connection failed Networkmanager is now in the 'connected (site only)' state eth2: disconnected eth2: using connection 'Wired connection 3' eth2: connecting (prepare) Networkmanager is now in the 'connecting' state eth2: connecting (configuring) eth2: connecting (getting IP configuration) eth2: connection failed Networkmanager is now in the 'connected (site only)' state eth2: disconnected
I saw that 13.0 released today. Any idea if this issue has been resolved?
I saw that 13.0 released today. Any idea if this issue has been resolved?
Since i have updated to HA OS 13 i have this problem. before the update mine was running fine with a ZigBee stick and RPI3B+. But now its broken 😢 and restarting all the time. I was able to deactivate the "Sonoff Zigbee 3.0 USB Dongle Plus" Integration it is running but without zigbee sensors.
Me also!
https://www.reddit.com/r/homeassistant/comments/1est4zd/update_often_crashes_everything/
I saw that 13.0 released today. Any idea if this issue has been resolved?
Since i have updated to HA OS 13 i have this problem. before the update mine was running fine with a ZigBee stick and RPI3B+. But now its broken 😢 and restarting all the time. I was able to deactivate the "Sonoff Zigbee 3.0 USB Dongle Plus" Integration it is running but without zigbee sensors.
What version did you run before updating to HA OS 13.0?
I saw that 13.0 released today. Any idea if this issue has been resolved?
Since i have updated to HA OS 13 i have this problem. before the update mine was running fine with a ZigBee stick and RPI3B+. But now its broken 😢 and restarting all the time. I was able to deactivate the "Sonoff Zigbee 3.0 USB Dongle Plus" Integration it is running but without zigbee sensors.
What version did you run before updating to HA OS 13.0?
12.3 or 12.4 can't remember wich one exactly... anyway i fixed it by doing a fresh install of HAOS 13.0. Now its running like before even with 13.0.
Hello, I have the Sonoff dobgle-e plus with Home assistant OS on RPi B3+. From my side the problem was only when the zigbee2mqtt add-on was starting: the host restarted each time. But it appeared only when I restarted the host yesterday, which was still in 12.2, last time I restarted was months ago. So I took this time to upgrade the dongle and when I restarted all the thing and plugged the dongle no more problems so I thought the firmware update of the dongle resolved it.
I decided to jump to 13.1 to see how it's going on: so the host restarted and same problem appeared again, restart loop only when z2m add-on start.
So, this is how I overcame it after tests, I have to respect these steps:
- stop z2m add-on and don't let it start at HA start
- unplug the dongle
- restart host
- wait some minutes after restart
- plug the dongle
- start z2m
My conclusion is for now, the host must not restart when dongle is plugged, I suppose for the next host update I will have to replay these steps before update and only at the end, plug the dongle.