linux icon indicating copy to clipboard operation
linux copied to clipboard

[4.19.126] Sporadic freeze related to mmc

Open sahib opened this issue 3 years ago • 0 comments

Describe the bug

Hello,

I'm currently debugging a nasty freeze on a CM3+ with a 4.19 kernel using a Yocto build based on meta-raspberrypi. The symptoms are as follows:

  • Everything from the ui to all other services of the device lock up.
  • No new SSH connections are accepted (but previously existing ones, most of the time, persist)
  • Attached USB keyboards do not get recognized (no num lock, but SysRq keys over serial still work!)
  • This usually happens within the first few minutes of the device running (most I/O happens there).
  • More I/O over a long time range seems to trigger the bug more often (I don't have clear data on this yet), but disabling quite some of our applications make the freeze less likely.
  • MemFree & MemAvailable seem to rise in moment of freeze. So not a typical OOM situation.
  • If it happens, it often happens in batches, then often stays away for a few days.

After quite some time, I found a kind of workaround to alleviate our situation. Applying this device-tree overlay seems to make the freeze appear far less likely or not at all anymore (not sure yet which of the two applies)

/dts-v1/;
/plugin/;

/ {
	compatible = "brcm,bcm2708";
	fragment@0 {
		target = <&sdhost>;
		__overlay__ {
			non-removable;
                        brcm,force-pio;
		};
	};
};

The non-removable part does not seem to be necessary - I added it because I saw the kernel getting stuck in mmc_rescan often. Since we have a eMMC in the CM3+ we only need to execute it once. Freezes happen also with only non-removable, so the critical part seems to be brcm,force-pio which effectively forces the brcm2538-sdhost driver to not use DMA, but use the slower PIO instead.

Steps to reproduce the behaviour

  • Let our proprietary (sorry) system run for some time in a regular I/O board with good power supply.
  • Usually you get a freeze within 24 hours.
  • No good way was discovered to actually trigger the problem. Letting the system run under stress-ng seems fine.

Device (s)

Raspberry Pi CM3+

System

$ uname -a
Linux hostname 4.19.126-v7 #1 SMP Fri Sep 30 08:06:14 UTC 2022 armv7l GNU/Linux
$ vcgencmd version
Aug 26 2022 14:04:36
Copyright (c) 2012 Broadcom
version 102f1e848393c2112206fadffaaf86db04e98326 (clean) (release) (start)
$ cat /proc/cmdline
8250.nr_uarts=1 bcm2708_fb.fbwidth=480 bcm2708_fb.fbheight=800 bcm2708_fb.fbswap=1 dwc_otg.lpm_enable=0 
usbhid.mousepoll=0 vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000 cma=512M@128M coherent_pool=6M 
fbcon=vc:2-4 logo.nologo quiet video=HDMI-A-1:480x800MR-24@60 ostree=/ostree/boot.1
/poky/2fb5f4d11ab0a08ad01ad2b44fee499ee3d129c276dd0f41184fdc79850e413e/0  ostree_root=/dev/mmcblk0p2 
root=/dev/ram0 rw rootwait rootdelay=2 ramdisk_size=8192 panic=1
$ cat config.txt
disable_overscan=1
gpu_mem=128
boot_delay=0
boot_delay_ms=0
disable_splash=1
dispmanx_offline=1
dtparam=i2c1=on
dtparam=i2c_arm=on
enable_uart=1
dtoverlay=vc4-kms-v3d
hdmi_edid_file=1
avoid_warnings=2
mask_gpu_interrupt0=0x400
dtparam=audio=on

Logs

This log was from one of the devices that exhibited the issue. The actual freezed tasks differ a bit from time to time, but the common one is always mmc_rescan.

dmesg-prior-to-fix.log

With device-tree overlay that only has "non-removable" set the log becomes a bit more interesting: There are a lot of tasks and the memory info at the top seems to indicate that there's enough free pages. A bit of swap was used and applications/tasks using it seem to be stuck completely too. Another bunch of tasks are stuck in rpi_firmware_transaction which is interesting since they should not be blocked by MMC.

freeze-with-non-removable-but-without-force-pio.log

Note: The latter log was obtained from a serial connection with SysRq keys.

Additional context

This issue was reported at the meta-raspberrypi repository. It was suggested to report this here. The real question is: why does setting force-pio help? Is it an actual workaround a bug or is it just making the freeze less likely due to slower I/O.

sahib avatar Sep 30 '22 12:09 sahib