Freeze on Reboot (CM4, scarthgap, RAUC)
I'm using the meta-raspberrypi layer (scarthgap) on CM4.
I have two root filesystems (A/B), which I update via RAUC in conjunction with rauc-hawkbit-updater. After RAUC finishes an update cycle, it performs a reboot using the reboot shell command (post-install hook). This works fine from an update perspective, but the CM4 often freezes during shutdown. There is also often kernel backtrace attached. The system doesn't recover itself. It needs an power cycle. After that, it boots fine again.
The exact Kernel Backtrace varies from Kernel version to Kernel version. This is one example (The system hangs forever after this output via Serial, this is the last Serial output, Power cycle needed):
[ 177.933557] systemd-shutdown[1]: All filesystems unmounted.
[ 177.939284] systemd-shutdown[1]: Deactivating swaps.
[ 177.944496] systemd-shutdown[1]: All swaps deactivated.
[ 177.949923] systemd-shutdown[1]: Detaching loop devices.
[ 177.960674] systemd-shutdown[1]: All loop devices detached.
[ 177.966394] systemd-shutdown[1]: Stopping MD devices.
[ 177.971775] systemd-shutdown[1]: All MD devices stopped.
[ 177.977201] systemd-shutdown[1]: Detaching DM devices.
[ 177.982592] systemd-shutdown[1]: All DM devices detached.
[ 177.988096] systemd-shutdown[1]: All filesystems, swaps, loop devices, MD devices and DM devices detached.
[ 178.001703] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 178.008679] systemd-shutdown[1]: Rebooting with argument '0 tryboot'.
[ 218.845608] mmc0: card 0001 removed
[ 218.954746] EXT4-fs (mmcblk0p2): shut down requested (2)
[ 218.960210] Aborting journal on device mmcblk0p2-8.
[ 363.485953] INFO: task systemd-shutdow:1 blocked for more than 120 seconds.
[ 363.493077] Tainted: G C 6.6.30-v8 #1
[ 363.498752] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.506717] task:systemd-shutdow state:D stack:0 pid:1 ppid:0 flags:0x00000004
[ 363.515221] Call trace:
[ 363.517704] __switch_to+0xe0/0x160
[ 363.521264] __schedule+0x37c/0xd68
[ 363.524816] schedule+0x64/0x108
[ 363.528102] schedule_timeout+0x1a8/0x1c0
[ 363.532186] wait_for_completion+0xc8/0x170
[ 363.536444] __flush_work.isra.0+0x190/0x2e8
[ 363.540794] __cancel_work_timer+0x124/0x1b0
[ 363.545139] cancel_delayed_work_sync+0x1c/0x30
[ 363.549753] __mmc_stop_host+0x44/0x60
[ 363.553570] mmc_host_classdev_shutdown+0x1c/0x38
[ 363.558360] device_shutdown+0x138/0x268
[ 363.562357] kernel_restart+0x48/0xd0
[ 363.566087] __do_sys_reboot+0x1b4/0x258
[ 363.570083] __arm64_sys_reboot+0x2c/0x40
[ 363.574166] invoke_syscall+0x50/0x128
[ 363.577986] el0_svc_common.constprop.0+0x48/0xf0
[ 363.582775] do_el0_svc+0x24/0x38
[ 363.586152] el0_svc+0x40/0xe8
[ 363.589258] el0t_64_sync_handler+0x120/0x130
[ 363.593694] el0t_64_sync+0x190/0x198
[ 363.597437] INFO: task kworker/2:4:68 blocked for more than 120 seconds.
[ 363.604255] Tainted: G C 6.6.30-v8 #1
[ 363.609923] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.617891] task:kworker/2:4 state:D stack:0 pid:68 ppid:2 flags:0x00000008
[ 363.626395] Workqueue: events_freezable mmc_rescan
[ 363.631275] Call trace:
[ 363.633756] __switch_to+0xe0/0x160
[ 363.637307] __schedule+0x37c/0xd68
[ 363.640856] schedule+0x64/0x108
[ 363.644140] io_schedule+0x44/0x68
[ 363.647602] bit_wait_io+0x20/0x80
[ 363.651065] __wait_on_bit+0x58/0x180
[ 363.654789] out_of_line_wait_on_bit+0x90/0xc8
[ 363.659312] __wait_on_buffer+0x34/0x48
[ 363.663218] jbd2_write_superblock+0x13c/0x2f0
[ 363.667743] jbd2_journal_update_sb_errno+0x5c/0x98
[ 363.672708] jbd2_journal_abort+0x88/0x100
[ 363.676878] ext4_force_shutdown+0x1e0/0x1f8
[ 363.681223] ext4_shutdown+0x1c/0x30
[ 363.685788] fs_bdev_mark_dead+0x68/0xb0
[ 363.690701] bdev_mark_dead+0x4c/0xe0
[ 363.695352] blk_report_disk_dead+0x88/0xf8
[ 363.700537] del_gendisk+0x320/0x348
[ 363.705089] mmc_blk_remove+0xa4/0x248
[ 363.709800] mmc_bus_remove+0x28/0x40
[ 363.714421] device_remove+0x54/0x90
[ 363.718957] device_release_driver_internal+0x1d4/0x238
[ 363.725162] device_release_driver+0x20/0x38
[ 363.730389] bus_remove_device+0xd4/0x120
[ 363.735330] device_del+0x164/0x430
[ 363.739732] mmc_remove_card+0x60/0xe0
[ 363.744398] mmc_detect+0x5c/0xa0
[ 363.748615] mmc_rescan+0x94/0x328
[ 363.752914] process_one_work+0x148/0x3b8
[ 363.757824] worker_thread+0x32c/0x450
[ 363.762467] kthread+0x118/0x128
[ 363.766573] ret_from_fork+0x10/0x20
Sometimes it also just hangs without Backtrace. In this case the Kernel even still printed some messages (The system hangs forever after this output via Serial, this is the last Serial output, Power cycle needed):
[ 116.566952] systemd-shutdown[1]: All filesystems unmounted.
[ 116.572818] systemd-shutdown[1]: Deactivating swaps.
[ 116.578075] systemd-shutdown[1]: All swaps deactivated.
[ 116.583413] systemd-shutdown[1]: Detaching loop devices.
[ 116.597291] systemd-shutdown[1]: All loop devices detached.
[ 116.603015] systemd-shutdown[1]: Stopping MD devices.
[ 116.608583] systemd-shutdown[1]: All MD devices stopped.
[ 116.614019] systemd-shutdown[1]: Detaching DM devices.
[ 116.619598] systemd-shutdown[1]: All DM devices detached.
[ 116.625137] systemd-shutdown[1]: All filesystems, swaps, loop devices, MD devices and DM devices detached.
[ 116.638948] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 116.646282] systemd-shutdown[1]: Rebooting with argument '0 tryboot'.
[21626.827418] IPv4: martian source 255.255.255.255 from 192.168.0.248, on dev eth0
[21626.834971] ll header: 00000000: ff ff ff ff ff ff d8 eb 46 86 fd 39 08 00
[21628.434569] IPv4: martian source 255.255.255.255 from 192.168.0.56, on dev eth0
[21628.442006] ll header: 00000000: ff ff ff ff ff ff b4 8a 0a b1 6c 90 08 00
[...]
[23703.944066] IPv4: martian source 255.255.255.255 from 192.168.0.56, on dev eth0
[23703.951501] ll header: 00000000: ff ff ff ff ff ff b4 8a 0a b1 6c 90 08 00
[23708.944633] IPv4: martian source 255.255.255.255 from 192.168.0.56, on dev eth0
[23708.952069] ll header: 00000000: ff ff ff ff ff ff b4 8a 0a b1 6c 90 08 00
[23712.222015] bcmgenet fd580000.ethernet eth0: Link is Down
[23715.294285] bcmgenet fd580000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off
Anyone ever encountered such an issue while using CM4/meta-raspberrypi?
--
I can't rule out any memory corruption issues in user-space. I wasn't yet able to get such a fault without performing an RAUC Update Cycle before rebooting. I wasn't yet able to provoke similar on Raspbian. I'm not yet sure if the Kernel stops working, or if it is still running which would point to a Systemd problem.
i encounter this issue with a rpi4 + meta-raspberrypi + scarthgap. i use swupdate and not rauc, but imho that is irrelevant.
workaround: set kernel parameter via
CMDLINE:append:pn-rpi-cmdline = " bcm2835-wdt.nowayout=1"
@mlilien Nice to hear from someone encountering the same (or similar) problem! Also thanks for the workaround, nice idea to use that WDT parameter.
Have you found a good way to reproduce this? My current way is to update the CM4 with RAUC in a loop. This way it shows the error within 1-5 days.
Do you have a DSI Display connected to the RPI4?
Have you found a good way to reproduce this?
Unfortunately no.
Do you have a DSI Display connected to the RPI4?
No, the system is headless.
You can also instruct systemd to configure the HW watchdog to a shorter time during reboot, see systemd-system.conf -- the default timeout is 10 minutes.
Of course that's only the workaround version.
What puzzles me is the order of outputs in the first log. At the point where Rebooting with argument is printed, all filesystems should be either fully unmounted or at least be remounted r/o. This doesn't seem to be the case here, since mmcblk0p2 is still active r/w and the card removal triggers a write of the journal.
I would try to find out what mmcblk0p2 is (rootfs?which one?) and who mounted it in the first place. If mounting was done through normal systemd/fstab mechanism then the fs should be unmounted when All filesystems unmounted is printed.
@tobiasjakobi-lr Thank you for your response!
The missing unmounting is a good catch!
I'm still (passively) working on this problem. But I have a few new insights. It looks like the real cause for the indefinite hang is caused by the DSI panel which is connected to the CM4. I was able to reproduce this with Raspbian.
Unfortunately at the time where this indefinite hang happens the watchdog is already disabled by systemd and thus not catching.
@berkutta systemd never disables the HW watchdog (if you let it use it in the first place). During normal operation it honors the RuntimeWatchdogSec= setting, and then, once it enters the reboot phase, it reprograms the watchdog to use the RebootWatchdogSec= setting. There are some watchdogs where you can't reprogram the timeout. Hence you stay on the previous timeout.