[4.19.126] Sporadic freeze related to mmc
Describe the bug
Hello,
I'm currently debugging a nasty freeze on a CM3+ with a 4.19 kernel using a Yocto build based on meta-raspberrypi. The symptoms are as follows:
- Everything from the ui to all other services of the device lock up.
- No new SSH connections are accepted (but previously existing ones, most of the time, persist)
- Attached USB keyboards do not get recognized (no num lock, but SysRq keys over serial still work!)
- This usually happens within the first few minutes of the device running (most I/O happens there).
- More I/O over a long time range seems to trigger the bug more often (I don't have clear data on this yet), but disabling quite some of our applications make the freeze less likely.
- MemFree & MemAvailable seem to rise in moment of freeze. So not a typical OOM situation.
- If it happens, it often happens in batches, then often stays away for a few days.
After quite some time, I found a kind of workaround to alleviate our situation. Applying this device-tree overlay seems to make the freeze appear far less likely or not at all anymore (not sure yet which of the two applies)
/dts-v1/;
/plugin/;
/ {
compatible = "brcm,bcm2708";
fragment@0 {
target = <&sdhost>;
__overlay__ {
non-removable;
brcm,force-pio;
};
};
};
The non-removable part does not seem to be necessary - I added it because I saw the kernel getting stuck in mmc_rescan often. Since we have a eMMC in the CM3+ we only need to execute it once. Freezes happen also with only non-removable, so the critical part seems to be brcm,force-pio which effectively forces the brcm2538-sdhost driver to not use DMA, but use the slower PIO instead.
Steps to reproduce the behaviour
- Let our proprietary (sorry) system run for some time in a regular I/O board with good power supply.
- Usually you get a freeze within 24 hours.
- No good way was discovered to actually trigger the problem. Letting the system run under
stress-ngseems fine.
Device (s)
Raspberry Pi CM3+
System
$ uname -a
Linux hostname 4.19.126-v7 #1 SMP Fri Sep 30 08:06:14 UTC 2022 armv7l GNU/Linux
$ vcgencmd version
Aug 26 2022 14:04:36
Copyright (c) 2012 Broadcom
version 102f1e848393c2112206fadffaaf86db04e98326 (clean) (release) (start)
$ cat /proc/cmdline
8250.nr_uarts=1 bcm2708_fb.fbwidth=480 bcm2708_fb.fbheight=800 bcm2708_fb.fbswap=1 dwc_otg.lpm_enable=0
usbhid.mousepoll=0 vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000 cma=512M@128M coherent_pool=6M
fbcon=vc:2-4 logo.nologo quiet video=HDMI-A-1:480x800MR-24@60 ostree=/ostree/boot.1
/poky/2fb5f4d11ab0a08ad01ad2b44fee499ee3d129c276dd0f41184fdc79850e413e/0 ostree_root=/dev/mmcblk0p2
root=/dev/ram0 rw rootwait rootdelay=2 ramdisk_size=8192 panic=1
$ cat config.txt
disable_overscan=1
gpu_mem=128
boot_delay=0
boot_delay_ms=0
disable_splash=1
dispmanx_offline=1
dtparam=i2c1=on
dtparam=i2c_arm=on
enable_uart=1
dtoverlay=vc4-kms-v3d
hdmi_edid_file=1
avoid_warnings=2
mask_gpu_interrupt0=0x400
dtparam=audio=on
Logs
This log was from one of the devices that exhibited the issue. The actual freezed tasks differ a bit from time to time, but the common one is always mmc_rescan.
With device-tree overlay that only has "non-removable" set the log becomes a bit more interesting: There are a lot of tasks and the memory info at the top seems to indicate that there's enough free pages. A bit of swap was used and applications/tasks using it seem to be stuck completely too. Another bunch of tasks are stuck in rpi_firmware_transaction which is interesting since they should not be blocked by MMC.
freeze-with-non-removable-but-without-force-pio.log
Note: The latter log was obtained from a serial connection with SysRq keys.
Additional context
This issue was reported at the meta-raspberrypi repository. It was suggested to report this here. The real question is: why does setting force-pio help? Is it an actual workaround a bug or is it just making the freeze less likely due to slower I/O.