linux icon indicating copy to clipboard operation
linux copied to clipboard

CM5 without wifi hangs on reboot

Open nbuchwitz opened this issue 11 months ago • 41 comments

Describe the bug

We stumbled over an issue where all CM5 without wifi seem to hang when rebooted. After some waiting the reboot is completed whereas all CM5 with wifi show no such error (same base boards, same software). As is some care cases the reboot even worked on CM5 without wifi I started to debug it further.

When reboot hangs:

Dec 01 13:28:51 RevPi systemd[1]: Shutting down.
Dec 01 13:28:51 RevPi systemd[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Dec 01 13:28:51 RevPi systemd[1]: Watchdog running with a hardware timeout of 10min.
Dec 01 13:28:51 RevPi kernel: watchdog: watchdog0: watchdog did not stop!
Dec 01 13:28:51 RevPi systemd-shutdown[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Dec 01 13:28:52 RevPi systemd-shutdown[1]: Watchdog running with a hardware timeout of 10min.
Dec 01 13:28:52 RevPi systemd-shutdown[1]: Syncing filesystems and block devices.
Dec 01 13:28:52 RevPi systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Dec 01 13:28:52 RevPi systemd-journald[167]: Received SIGTERM from PID 1 (systemd-shutdow).
Dec 01 13:28:52 RevPi systemd-journald[167]: Journal stopped

When reboot works immediately:

Dec 01 13:29:57 RevPi136828 systemd[1]: Shutting down.
Dec 01 13:29:58 RevPi136828 systemd[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Dec 01 13:29:58 RevPi136828 systemd[1]: Watchdog running with a hardware timeout of 10min.
Dec 01 13:29:58 RevPi136828 kernel: mmc1: Failed to initialize a non-removable card
Dec 01 13:29:58 RevPi136828 kernel: watchdog: watchdog0: watchdog did not stop!
Dec 01 13:29:58 RevPi136828 systemd-shutdown[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Dec 01 13:29:58 RevPi136828 systemd-shutdown[1]: Watchdog running with a hardware timeout of 10min.
Dec 01 13:29:58 RevPi136828 systemd-shutdown[1]: Syncing filesystems and block devices.
Dec 01 13:29:58 RevPi136828 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Dec 01 13:29:58 RevPi136828 systemd-journald[174]: Received SIGTERM from PID 1 (systemd-shutdow).
Dec 01 13:29:58 RevPi136828 systemd-journald[174]: Journal stopped

The culprit seems to be (always present when the reboot works):

Dec 01 13:29:58 RevPi136828 kernel: mmc1: Failed to initialize a non-removable card

So it looks like there might be an issue with the unused sdio/ mmc1 which is not used on the wifi less variant of CM5. In order to verify my suspicion I've created a simple overlay which deactivates sdio 2 completely:

[...]
       fragment@13 {
               target = <&sdio2>;
               __overlay__ {
                       status = "disabled";
               };
       };

With this the reboot works reliable in all tests so far. Even though it kinda works with a custom overlay it looks wrong. It also is not a reliable solution for production as during first boot only the cm5io dt loaded by the firmware is present and a subsequent reboot will fail very often.

Same works on CM4 with / without wifi (different overlay though, but should be irrelevant as it also happens with pure CM dt).

Any ideas / insights on this?

Steps to reproduce the behaviour

  1. Boot device with CM5 without wifi module
  2. sudo reboot

Device (s)

Raspberry Pi CM5

System

2024/09/23 14:02:56 
Copyright (c) 2012 Broadcom
version 26826259 (release) (embedded)

EEPROM release: 1727096576

Kernel: 6.6.74+rpt-rpi-v8

Logs

No response

Additional context

No response

nbuchwitz avatar Feb 04 '25 10:02 nbuchwitz

I did some further research and noticed that /sys/kernel/debug/mmc1/ios differs in good and bad cases:

pi@RevPi136828:~/debug$ diff --side-by-side working/mmc1_ios notworking/mmc1_ios 
clock:		0 Hz					      |	clock:		100000 Hz
vdd:		0 (invalid)				      |	actual clock:	100000 Hz
							      >	vdd:		21 (3.3 ~ 3.4 V)
bus mode:	2 (push-pull)					bus mode:	2 (push-pull)
chip select:	0 (don't care)					chip select:	0 (don't care)
power mode:	0 (off)					      |	power mode:	2 (on)
bus width:	0 (1 bits)					bus width:	0 (1 bits)
timing spec:	0 (legacy)					timing spec:	0 (legacy)
signal voltage:	0 (3.30 V)					signal voltage:	0 (3.30 V)
driver type:	0 (driver type B)				driver type:	0 (driver type B)

What could be the reason that power mode is set to on in the non-working (=hangs during reboot) case?

It also seems that if the power mode is set to on it is reset to off after approx. 53 seconds (see attached debug log, first line is date, then uptime in seconds and then mmc1_ios content)

debug.txt

After I performed a firmware update to 1737505011 the time after the power mode is switched to off increased to ~ 83 seconds (~ +30 seconds, 1737983339 is about 10 seconds less).

debug-fw1737505011.txt

A downgrade to 1731427844 showed the same behavior as with 1727096576 (initial firmware on this compute module): power_mode is set to off after approx 53 seconds:

debug-fw1731427844.txt

Handover to OS is about 8-9 seconds, so I don't think that the difference is resulted by something like this.

So it seems to me that this might be a firmware related issue or at least it has some influence.

Did also some testing on a CM4 without wifi and there /sys/kernel/debug/mmc1/ios shows that the interface is disabled correctly upon boot.

nbuchwitz avatar Feb 06 '25 13:02 nbuchwitz

Hi Nicolai, we'll look into disabling SDIO2 from the firmware for non-WiFi-enabled parts.

pelwell avatar Feb 10 '25 10:02 pelwell

Thanks Phil for the update

nbuchwitz avatar Feb 10 '25 10:02 nbuchwitz

pieeprom_cm5nowifi.zip Here's a trial build with a theoretical fix - it should disable sdio2 on a CM5 with no WiFi. I've tried it on a Pi 5 to confirm that it isn't completely broken, but I don't have a suitable CM5 to hand - the next task is to locate one.

pelwell avatar Feb 10 '25 12:02 pelwell

Give me some minutes and I will test it, I have modules at hand ...

nbuchwitz avatar Feb 10 '25 13:02 nbuchwitz

I can confirm, mmc1 is gone with the test firmware:

pi@RevPi136828:~$ ls -d /sys/kernel/debug/mmc?
/sys/kernel/debug/mmc0
pi@RevPi136828:~$ rpi-eeprom-update 
BOOTLOADER: up to date
   CURRENT: Mon Feb 10 12:04:08 PM UTC 2025 (1739189048)
    LATEST: Wed Jan 22 12:16:51 AM UTC 2025 (1737505011)
   RELEASE: default (/usr/lib/firmware/raspberrypi/bootloader-2712/default)
            Use raspi-config to change the release.

Reboot is also working without hang / delay.

nbuchwitz avatar Feb 10 '25 13:02 nbuchwitz

Great. We'll get that merged, then into a release at some point.

pelwell avatar Feb 10 '25 13:02 pelwell

Thanks. In the meantime I will do some thinking and come up with some tooling for our end of line tests, so we can update the modules in place.

nbuchwitz avatar Feb 10 '25 13:02 nbuchwitz

Just a note for others which might need to work around the issue that the first reboot after firmware update still hangs (which is fine as we're still running the old firmware):

# set power to permanently on in order to avoid timeout of probe cycles
echo on | sudo tee /sys/class/mmc_host/mmc1/device/power/control

# unbind driver on mmc1
basename $(realpath /sys/class/mmc_host/mmc1/../..) | sudo tee /sys/bus/platform/drivers/sdhci-brcmstb/unbind

nbuchwitz avatar Feb 11 '25 09:02 nbuchwitz

It's odd that a non-WiFi CM5 is rebooting without issue for me. I've tried rebooting before the mmc1: Failed to initialize a non-removable card error message (which I don't always see), and I've tried afterwards. This is with the stock firmware 2024/09/23, and with the latest release (Wed 22 Jan 00:16:51 UTC 2025 (1737505011)). The worst I see is a stall of up to 40 seconds until the mmc driver gives up (mmc1: Failed to initialize a non-removable card).

The power mode difference is just an indicator of whether or not the kernel has given up on there being something on that SDIO bus - it turns off the power when it loses hope.

pelwell avatar Feb 11 '25 12:02 pelwell

Yes, at some point the device is rebooting (after the driver gives up on mmc1). The issue (at least for us) is, that this causes timeouts during end of line test, as the systems expects the DUT to reboot within a reasonable period. On CM5 this extra delay after reboot is (depending on how fast the provisioning of the HAT eeprom was) up to 60 seconds which will case a timeout. Also noteworthy that on CM4 with non wifi variants this works without additional delay.

nbuchwitz avatar Feb 11 '25 13:02 nbuchwitz

The patch to disable sdio2 has been merged, so future EEPROM builds will include it. I do wonder though if the kernel retry mechanism can be adjusted to not take quite so long.

pelwell avatar Feb 11 '25 14:02 pelwell

I do wonder though if the kernel retry mechanism can be adjusted to not take quite so long.

That was also I was initially thinking when I raised this issue. Haven't had the time to dig deeper what the differences for bcm2711 and 2712 are here, but from a first look they share at least the same driver for mmc1.

nbuchwitz avatar Feb 11 '25 15:02 nbuchwitz

The rescan code tries 3 different card types at 4 different clock frequencies. All of those tests involve timeouts of specific durations, so they shouldn't simply be shortened. The other approach would be to make the scanning interruptable at some granularity - at least between frequencies. There may be a way to mark that the interface is being shut down - perhaps using the rescan_disable flag - but it's not something I'd want to do hastily.

pelwell avatar Feb 11 '25 19:02 pelwell

same issue with Pi5

Feb 24 19:23:12 RaspberryPi5 systemd[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0 Feb 24 19:23:12 RaspberryPi5 systemd[1]: Watchdog running with a hardware timeout of 10min. Feb 24 19:23:12 RaspberryPi5 kernel: watchdog: watchdog0: watchdog did not stop! Feb 24 19:23:12 RaspberryPi5 systemd-shutdown[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0 Feb 24 19:23:12 RaspberryPi5 systemd-shutdown[1]: Watchdog running with a hardware timeout of 10min. Feb 24 19:23:12 RaspberryPi5 systemd-shutdown[1]: Syncing filesystems and block devices. Feb 24 19:23:12 RaspberryPi5 systemd-shutdown[1]: Sending SIGTERM to remaining processes... Feb 24 19:23:12 RaspberryPi5 systemd-journald[292]: Received SIGTERM from PID 1 (systemd-shutdow). Feb 24 19:23:12 RaspberryPi5 systemd-journald[292]: Journal stopped

neonblind avatar Feb 25 '25 15:02 neonblind

This issue smells very similar to this: https://forums.raspberrypi.com/viewtopic.php?t=288866

Just a note for others which might need to work around the issue that the first reboot after firmware update still hangs (which is fine as we're still running the old firmware):

# set power to permanently on in order to avoid timeout of probe cycles
echo on | sudo tee /sys/class/mmc_host/mmc1/device/power/control

# unbind driver on mmc1
basename $(realpath /sys/class/mmc_host/mmc1/../..) | sudo tee /sys/bus/platform/drivers/sdhci-brcmstb/unbind

After running these two commands (with mmc0), I am able to shutdown my CM5Lite, booted off NVMe, no SD card inserted, with no hang. Though, the unbind takes ~24s intermittently (sometimes <100ms, sometimes 20-50s) which is not ideal.

My current workaround is to just disable the interface entirely with a dtoverlay...but it would be nice to be able to still have an SD card work.

/dts-v1/;
/plugin/;

/ {
    compatible = "brcm,bcm2712";

    fragment@0 {
        target = <&sdio1>;
        __overlay__ {
            status = "disabled";
        };
    };
};

Muny avatar Mar 17 '25 23:03 Muny

There is already an overlay for this (its called disable-wifi or wlan i think). But this shouldn't be necessary with the firmware update. Did you already update the eeprom on your cm5? If not: sudo rpi-eeprom-update -a

nbuchwitz avatar Mar 18 '25 06:03 nbuchwitz

still having this problem with the raspberry pi compute module 5 with linux 6.6.51 6.6.74 and 6.12.19 from rpi-update. I also have the latest eeprom with sudo rpi-eeprom-update -a. It works once after updating the linux version but after rebooting once it goes back to the same issue where it is stuck on watchdog0 or systemd halt when doing both reboot and halt. This is from a fresh install of raspberry pi os lite 64 bit from raspberry pi imager. Can anybody help me with this issue?

hasan-akbulak avatar Mar 25 '25 10:03 hasan-akbulak

Report output of vcgencmd bootloader_version

popcornmix avatar Mar 25 '25 13:03 popcornmix

root@raspberrypi:~# vcgencmd bootloader_version 2025/03/19 13:41:26 version cec1d3ae40f4a1cb24fe3c42d60153968695385b (release) timestamp 1742391686 update-time 1742896386 capabilities 0x0000007f

hasan-akbulak avatar Mar 25 '25 14:03 hasan-akbulak

Okay, that should contain the fix referenced here.

popcornmix avatar Mar 26 '25 14:03 popcornmix

whenever i try to shutdown or reboot the pi it still has the same issue of stalling inbetween 20-50 seconds and dmesg still reports mmc errors even when the sd card is disabled in config.txt i dont know how to fix this issue and i have also tried multiple io boards with still the same issue

hasan-akbulak avatar Mar 26 '25 14:03 hasan-akbulak

What does sudo vclog -m report?

pelwell avatar Mar 26 '25 14:03 pelwell

tc@raspberrypi:~ $ sudo vclog -m 005414.426: Initial voltage 800000 temp 42226 005614.834: avs_2712: AVS pred 8945 894500 temp 42226 005618.442: vpred 894 mV +0 005632.134: FB framebuffer_swap 1 005651.534: Select resolution HDMI0/2 hotplug 1 max_mode 2 005667.959: HDMI0 edid block 0 offset 0 005670.339: 00ffffffffffff00410c55c17e7d0000 005676.011: 2a1e010380351e782a0565a756529c27 005681.684: 0f5054bfef00d1c0b300950081808140 005687.357: 81c001010101023a801871382d40582c 005693.030: 45000f282100001e2a4480a070382740 005698.703: 302035000f282100001a000000fc0050 005704.376: 484c2032343356370a202020000000fd 005710.049: 00324c1e5311000a2020202020200115 005728.097: HDMI0 edid block 1 offset 128 005730.654: 02031ef14b101f051404130312021101 005736.327: 230907078301000065030c0010008c0a 005742.000: d08a20e02d10103e96000f2821000018 005747.673: 011d007251d01e206e2855000f282100 005753.346: 001e8c0ad08a20e02d10103e96000f28 005759.018: 210000188c0ad090204031200c405500 005764.691: 0f282100001800000000000000000000 005770.364: 000000000000000000000000000000cd 005776.055: HDMI0: best-mode 2 (limit 2) 1920x1080 60 Hz CEA modes 3e001f80000000000000000000000000 extensions 1 005787.649: Select resolution HDMI1/2 hotplug 0 max_mode 2 005794.571: FB0 disp 0 max-fb 2 1920x1080 stride 3840 base 0x3f800000 006127.100: dtb_file 'bcm2712-rpi-cm5l-cm5io.dtb' 006204.752: Loaded overlay 'bcm2712d0' 006301.854: dtparam: i2c_arm=on 006318.480: dtparam: audio=on 006324.419: Unknown dtparam 'audio' - ignored 006353.119: Loaded overlay 'audioinjector-isolated-soundcard' 006459.728: Loaded overlay 'vc4-kms-v3d-pi5' 006570.091: Loaded overlay 'dwc2' 006571.952: dtparam: dr_mode=peripheral 006577.367: dtparam: pciex1_gen=3 006591.870: dtparam: uart0_console=true 006645.601: Loaded overlay 'disable-bt-pi5' 006666.529: Loaded overlay 'disable-wifi-pi5' 006669.455: dtparam: i2c_vc=on 006685.972: dtparam: i2c_arm=on 006759.391: Loaded overlay 'vc4-kms-v3d-pi5' 006906.559: Loaded overlay 'vc4-kms-dsi-waveshare-panel' 006910.434: dtparam: 7_0_inchC=true 006916.606: dtparam: i2c1=true 006920.790: dtparam: sd_poll_once=true 006929.620: Unknown dtparam 'sd_poll_once' - ignored 006933.151: dtparam: fan_temp0=40000 006943.016: dtparam: fan_temp0_hyst=5000 006950.342: dtparam: fan_temp0_speed=70 006971.193: dtparam: fan_temp1=50000 006978.208: dtparam: fan_temp1_hyst=5000 006985.578: dtparam: fan_temp1_speed=120 007006.353: dtparam: fan_temp2=60000 007013.405: dtparam: fan_temp2_hyst=5000 007020.806: dtparam: fan_temp2_speed=150 007041.643: dtparam: fan_temp3=75000 007048.726: dtparam: fan_temp3_hyst=5000 007056.213: dtparam: fan_temp3_speed=255 007077.000: dtparam: sd=off 007083.077: Unknown dtparam 'sd' - ignored 007442.377: RPM 9052, max RPM 9052 009190.107: Starting OS 9190 ms 009195.631: 00000040: -> 00000480 009197.484: 00000030: -> 00100080 009202.196: 00000034: -> 00100080 009206.909: 00000038: -> 00100080 009211.622: 0000003c: -> 00100080 009321.194: sdram: sdram refresh 2081->4162 (2) 069314.739: initial_turbo of 60 deactivated

hasan-akbulak avatar Mar 26 '25 14:03 hasan-akbulak

Thanks.

006645.601: Loaded overlay 'disable-bt-pi5'
006666.529: Loaded overlay 'disable-wifi-pi5'

These lines show that the firmware has detected your no-WiFi CM5 and disabled Bluetooth and WiFi (or at least attempted to).

The rest shows that you have several other overlays and parameters in there. Please remove them (or comment them out) for testing purposes.

pelwell avatar Mar 26 '25 14:03 pelwell

i am sorry these parameters were from a non fresh install let me do a fresh install to remove any extra variabels.

default settings everything i only did a sudo apt update and upgrade.

When rebooting the issue persists. Here is my sudo vclog -m:

005410.141: Initial voltage 800000 temp 43875
005610.558: avs_2712: AVS pred 8945 894500 temp 44424
005614.166: vpred 894 mV +0
005627.756: FB framebuffer_swap 1
005647.141: Select resolution HDMI0/2 hotplug 1 max_mode 2
005663.564: HDMI0 edid block 0 offset 0
005665.944: 00ffffffffffff00410c55c17e7d0000
005671.617: 2a1e010380351e782a0565a756529c27
005677.290: 0f5054bfef00d1c0b300950081808140
005682.962: 81c001010101023a801871382d40582c
005688.635: 45000f282100001e2a4480a070382740
005694.308: 302035000f282100001a000000fc0050
005699.981: 484c2032343356370a202020000000fd
005705.654: 00324c1e5311000a2020202020200115
005723.702: HDMI0 edid block 1 offset 128
005726.260: 02031ef14b101f051404130312021101
005731.932: 230907078301000065030c0010008c0a
005737.605: d08a20e02d10103e96000f2821000018
005743.278: 011d007251d01e206e2855000f282100
005748.951: 001e8c0ad08a20e02d10103e96000f28
005754.624: 210000188c0ad090204031200c405500
005760.297: 0f282100001800000000000000000000
005765.969: 000000000000000000000000000000cd
005771.660: HDMI0: best-mode 2 (limit 2) 1920x1080 60 Hz CEA modes 3e001f80000000000000000000000000 extensions 1
005783.255: Select resolution HDMI1/2 hotplug 0 max_mode 2
005790.175: FB0 disp 0 max-fb 2 1920x1080 stride 3840 base 0x3f800000
006495.363: dtb_file 'bcm2712-rpi-cm5l-cm5io.dtb'
006576.053: Loaded overlay 'bcm2712d0'
006673.594: dtparam: audio=on
006682.439: Unknown dtparam 'audio' - ignored
006736.419: Loaded overlay 'vc4-kms-v3d-pi5'
006848.267: Loaded overlay 'dwc2'
006850.127: dtparam: dr_mode=host
007157.316: RPM 7824, max RPM 7824
008912.678: Starting OS 8912 ms
008918.200: 00000040: -> 00000480
008920.053: 00000030: -> 00100080
008924.765: 00000034: -> 00100080
008929.478: 00000038: -> 00100080
008934.191: 0000003c: -> 00100080
009043.765: sdram: sdram refresh 2081->4162 (2)
069008.455: initial_turbo of 60 deactivated

This is my version of raspberry pi os lite 64 bit:

tc@raspberrypi:~ $ sudo uname -a
Linux raspberrypi 6.6.74+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.74-1+rpt1 (2025-01-27) aarch64 GNU/Linux

And here is my bootloader version:

tc@raspberrypi:~ $ vcgencmd bootloader_version
2025/03/19 13:41:26
version cec1d3ae40f4a1cb24fe3c42d60153968695385b (release)
timestamp 1742391686
update-time 1742896386
capabilities 0x0000007f`
fresh config.txt:

tc@raspberrypi:~ $ sudo nano /boot/firmware/config.txt
  GNU nano 7.2                              /boot/firmware/config.txt

camera_auto_detect=1

# Automatically load overlays for detected DSI displays
display_auto_detect=1

# Automatically load initramfs files, if found
auto_initramfs=1

# Enable DRM VC4 V3D driver
dtoverlay=vc4-kms-v3d
max_framebuffers=2

# Don't have the firmware create an initial video= setting in cmdline.txt.
# Use the kernel's default instead.
disable_fw_kms_setup=1

# Run in 64-bit mode
arm_64bit=1

# Disable compensation for displays with overscan
disable_overscan=1

# Run as fast as firmware / board allows
arm_boost=1

[cm4]
# Enable host mode on the 2711 built-in XHCI USB controller.
# This line should be removed if the legacy DWC2 controller is required
# (e.g. for USB device mode) or if USB support is not required.
otg_mode=1

[cm5]
dtoverlay=dwc2,dr_mode=host

[all]

still hanging after watchdog 0:

Image

1/10 times it reboots instantly but 9/10 times hanging between 20-50 secs. No SD card is inserted. no idea what to do from here.

hasan-akbulak avatar Mar 26 '25 15:03 hasan-akbulak

sorry i don't know how to fix the layout issue to make it more readable i am quite new to github

hasan-akbulak avatar Mar 26 '25 15:03 hasan-akbulak

also just found this in dmesg: ''' [ 7.107845] Bluetooth: hci0: command 0xfc18 tx timeout [ 7.107856] Bluetooth: hci0: BCM: failed to write update baudrate (-110) [ 7.107858] Bluetooth: hci0: Failed to set baudrate [ 9.123848] Bluetooth: hci0: command 0xfc18 tx timeout [ 9.123859] Bluetooth: hci0: BCM: Reset failed (-110) ''' it isn't disabling bluetooth by default with my cm5 without wifi or bluetooth.

hasan-akbulak avatar Mar 26 '25 16:03 hasan-akbulak

What do these commands report?

$ od -An -tx4 --endian=big  /proc/device-tree/chosen/rpi-boardrev-ext
$ grep -a . /proc/device-tree/soc@107c000000/serial@7d50c000/status
$ grep -a . /proc/device-tree/axi/mmc@1100000/status

pelwell avatar Mar 26 '25 16:03 pelwell

[ I've added to the list of things to try ]

pelwell avatar Mar 26 '25 16:03 pelwell