linux icon indicating copy to clipboard operation
linux copied to clipboard

Workqueue: events_freezable mmc_rescan crash with Raspbian kernel 4.14.79-v7+

Open op-ecoprintq opened this issue 5 years ago • 71 comments

I have a few Raspberry Pi 3 B+ exhibiting the same problem. They crash after 2-3 days of uptime with the following error:

[169451.220021] INFO: task kworker/0:3:10949 blocked for more than 120 seconds. [169451.220036] Tainted: G C 4.14.79-v7+ #1159 [169451.220041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [169451.220048] kworker/0:3 D 0 10949 2 0x00000000 [169451.220077] Workqueue: events_freezable mmc_rescan [169451.220110] [<8079ef70>] (__schedule) from [<8079f5d8>] (schedule+0x50/0xa8) [169451.220130] [<8079f5d8>] (schedule) from [<8061a2d0>] (__mmc_claim_host+0xb8/0x1cc) [169451.220147] [<8061a2d0>] (__mmc_claim_host) from [<8061a414>] (mmc_get_card+0x30/0x34) [169451.220163] [<8061a414>] (mmc_get_card) from [<80623010>] (mmc_sd_detect+0x20/0x74) [169451.220179] [<80623010>] (mmc_sd_detect) from [<8061ccdc>] (mmc_rescan+0x1c8/0x394) [169451.220197] [<8061ccdc>] (mmc_rescan) from [<801379b4>] (process_one_work+0x158/0x454) [169451.220212] [<801379b4>] (process_one_work) from [<80137d14>] (worker_thread+0x64/0x5b8) [169451.220227] [<80137d14>] (worker_thread) from [<8013dd98>] (kthread+0x13c/0x16c) [169451.220246] [<8013dd98>] (kthread) from [<801080ac>] (ret_from_fork+0x14/0x28)

The machines are running Rasbian Stretch

op-ecoprintq avatar Jan 14 '19 22:01 op-ecoprintq

Same problem on a 3B after upgrading to 4.14.79-v7+ Crash every 1-3 days. Now downgrading to 4.14.52-v7+

ldrolez avatar Jan 18 '19 09:01 ldrolez

Hmm, what are the Pi's doing during that time? We haven't seen anything like this in the office, so would be interested to know what they are doing that might cause this.

JamesH65 avatar Jan 18 '19 10:01 JamesH65

Actually, this is a duplicate of an issue on our Linux tracker which is the correct place for it to be. Please continue conversation there...https://github.com/raspberrypi/linux/issues/2810

JamesH65 avatar Jan 18 '19 10:01 JamesH65

Bum, closed wrong one. Reopened. Will now attempt to close the one in firmware...

JamesH65 avatar Jan 18 '19 10:01 JamesH65

So the machines in question are running LXDE, a Java application, and the onboard keyboard.

op-ecoprintq avatar Jan 18 '19 16:01 op-ecoprintq

That doesn't tell me anything about their level of activity - CPU, file system, network etc.

pelwell avatar Jan 18 '19 16:01 pelwell

Same issue here. Attached is my kernel log. kern.log The issue occurs around Jan 18 19:57:02 This is on a headless install. The green light is constant, the PI responds to pings but is unable to open up an ssh session. This actually occurs for me during a movie being streamed by plex , around an hour and a half in. I left the pi to see if it would recover overnight, sadly it didn't.

Alexurb avatar Jan 19 '19 07:01 Alexurb

Happened again mid movie. Attached is another kernel log kern.log

Alexurb avatar Jan 21 '19 20:01 Alexurb

I've so far found some advice from https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/ summary: sudo sysctl -w vm.dirty_ratio=10 sudo sysctl -w vm.dirty_background_ratio=5 sudo sysctl -p seems to do the trick so far, still testing (3 movies down) I'll comment if it occurs again. To make it permanent(after reboot) read the link

Alexurb avatar Jan 23 '19 12:01 Alexurb

I've so far found some advice from https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/ summary: sudo sysctl -w vm.dirty_ratio=10 sudo sysctl -w vm.dirty_background_ratio=5 sudo sysctl -p seems to do the trick so far, still testing (3 movies down) I'll comment if it occurs again. To make it permanent(after reboot) read the link

So far so good.

Alexurb avatar Jan 23 '19 20:01 Alexurb

Just going say this thread saved my pi 3b+. Was hanging terrible on any intense IO. The solution by Alexurb fixed the problem for me.

peabody avatar Aug 29 '19 07:08 peabody

Same problem here on my Pi 3B+ running Raspbian Buster. Gonna try that edit and then reboot. Fingers crossed!

jdrch avatar Apr 04 '20 06:04 jdrch

Just ran into this with kernel 5.4.51-v8 aarch64 on Pi 4.

Ill try the fix recommended above in this thread:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Here's the dmesg errors:

[11722.001585] INFO: task kworker/1:0:1663 blocked for more than 120 seconds.
[11722.005139]       Tainted: G         C        5.4.51-v8 #1
[11722.008672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11722.012222] kworker/1:0     D    0  1663      2 0x00000028
[11722.015773] Workqueue: events_freezable mmc_rescan
[11722.019269] Call trace:
[11722.022724]  __switch_to+0x110/0x180
[11722.026149]  __schedule+0x2f4/0x750
[11722.029523]  schedule+0x44/0xe0
[11722.032807]  __mmc_claim_host+0xb8/0x210
[11722.036111]  mmc_get_card+0x38/0x50
[11722.039407]  mmc_sd_detect+0x24/0x90
[11722.042671]  mmc_rescan+0xc8/0x390
[11722.045892]  process_one_work+0x1c0/0x470
[11722.049078]  worker_thread+0x50/0x430
[11722.052405]  kthread+0x100/0x130
[11722.055223]  ret_from_fork+0x10/0x18

paralin avatar Sep 22 '20 05:09 paralin

will Too much of write can reduce life of SD CARDS? vm.dirty_background_ratio = 5 vm.dirty_ratio = 80 Can we try this setting.

My card hasn't failed since I made the change on April 4 of this year. I've been using the same SanDisk microSD card I bought new since mid-2018. I think you only need to worry about microSD card life if you're using super cheap or no name cards. If you use a high tier SanDisk you should be just fine.

jdrch avatar Nov 22 '20 05:11 jdrch

Seeing the same, seems to be a memory leak in lxpanel causing it to run out of memory and start swapping until it dies. htop output from ssh session I left running:

  Mem[||||||||||||||||||||||||||||||||||||||||||||882M/924M]   Uptime: 07:48:17
[...]
  719 pi         20   0  882M  488M   204 D  0.7 52.9 35:19.23 lxpanel --profile LXDE-pi
[...]

Straight after reboot:

  Mem[||||||||||||||||||||||||||||||||||||||||||||390M/924M]   Uptime: 00:03:50
[...]
 726 pi         20   0  401M 30128 23492 S  5.1  3.2  0:05.68 lxpanel --profile LXDE-pi
[...]

See also https://www.raspberrypi.org/forums/viewtopic.php?t=267015

ranma avatar Feb 23 '21 18:02 ranma

I haven't noticed this problem on my 3B+ since July 21 of last year, which is just before The Foundation engineer announced a fix had been committed.

Are you fully patched and updated on Buster and still seeing the issue?

One thing I'm thinking based on both my repo issue thread and the Raspberry Pi Forum one you linked to (thanks!) is this could be triggered by having the CPU % lxpanel plugin AND a process that causes CPU % to spike for a significant length of time, e.g. backintime as I did. If you read through my issue thread you'll notice the whole thing went away once I removed backintime.

jdrch avatar Feb 23 '21 18:02 jdrch

This is on a 1GB 3B+ with a freshly installed and updated raspbian image, so should it is buster: Raspberry Pi 3 Model B Plus Rev 1.3 Raspberry Pi reference 2021-01-11 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 21090519d85bdaa1615d5d5057d37b09368ea5d2, stage4 Linux raspberrypi 5.10.11-v7+ #1399 SMP Thu Jan 28 12:06:05 GMT 2021 armv7l GNU/Linux Videocore Jan 27 2021 22:26:53

It was crashing every ~6 hours from high lxpanel memory usage until I set up a script to restart lxpanel once an hour.

I also added: vm.dirty_ratio = 10 vm.dirty_background_ratio = 5

Besides lxde there's influxdb and a autostart chrome with grafana running, plus I'm using the official 7" DSI touch display. It's actually still likes to crash but less often, I haven't investigated the more recent ones yet.

ranma avatar Mar 06 '21 08:03 ranma

Actually it's still crashing from lxpanel it seems, so the hourly restart isn't good enough to prevent whatever lxpanel is doing. I also would have expected the OOM-killer to kick in, but somehow the system seems to lock up instead?

htop excerpt:

  1  [||||||||                                        13.0%]   Tasks: 78, 157 thr; 3 running
  2  [|||||||||                                       15.3%]   Load average: 39.39 40.12 37.97 
  3  [|||||||||                                       14.3%]   Uptime: 07:53:55
  4  [||||||||||                                      16.8%]
  Mem[||||||||||||||||||||||||||||||||||||||||||||881M/924M]
  Swp[||||||||||||||||||||||||||||||||||||||||100.0M/100.0M]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                         
  457 grafana    20   0  934M 10728     0 S 32.2  1.1 29:40.85 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pid
  451 influxdb   20   0  978M  9476     0 D 28.4  1.0 35:29.89 /usr/bin/influxd -config /etc/influxdb/influxdb.conf
  494 grafana    20   0  934M 10728     0 R 20.4  1.1 15:36.50 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pid
  731 grafana    20   0  934M 10728     0 S 11.6  1.1  9:18.06 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pid
 1347 influxdb   20   0  978M  9476     0 D  6.1  1.0  5:24.33 /usr/bin/influxd -config /etc/influxdb/influxdb.conf
[...]
  795 pi         20   0  636M 39568     0 D  1.0  4.2  3:43.82 /usr/lib/chromium-browser/chromium-browser-v7 --force-renderer-a
 1282 pi         20   0  429M 72396   372 D  0.8  7.6  4:00.12 /usr/lib/chromium-browser/chromium-browser-v7 --type=renderer --
[...]
 760 pi         20   0 1012M  591M     0 D  0.4 64.0 28:24.12 lxpanel --profile LXDE-pi

Kernel messages excerpt:

[  119.755918] logitech-hidpp-device 0003:046D:404D.0004: HID++ 4.1 device connected.
[14620.750775] INFO: task kworker/2:0:7862 blocked for more than 122 seconds.
[14620.750798]       Tainted: G         C        5.10.11-v7+ #1399
[14620.750807] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[14620.750818] task:kworker/2:0     state:D stack:    0 pid: 7862 ppid:     2 flags:0x00000000
[14620.750859] Workqueue: events_freezable mmc_rescan
[14620.750876] Backtrace:
[14620.750906] [<809e5190>] (__schedule) from [<809e5b60>] (schedule+0x68/0xe4)
[14620.750924]  r10:822b1800 r9:ffffe000 r8:00000000 r7:00000000 r6:40000113 r5:9d8b2e80
[14620.750934]  r4:ffffe000
[14620.750951] [<809e5af8>] (schedule) from [<807f72f8>] (__mmc_claim_host+0xe0/0x238)
[14620.750962]  r5:822b1a18 r4:00000002
[14620.750978] [<807f7218>] (__mmc_claim_host) from [<807f7488>] (mmc_get_card+0x38/0x3c)
[14620.750994]  r10:b6b53640 r9:00000000 r8:b6b56c00 r7:00000080 r6:822b1a18 r5:00000000
[14620.751005]  r4:822b2800
[14620.751041] [<807f7450>] (mmc_get_card) from [<808010a0>] (mmc_sd_detect+0x24/0x7c)
[14620.751053]  r5:822b1800 r4:822b1800
[14620.751071] [<8080107c>] (mmc_sd_detect) from [<807f9cc0>] (mmc_rescan+0xdc/0x3b0)
[14620.751083]  r5:822b1800 r4:822b1a7c
[14620.751103] [<807f9be4>] (mmc_rescan) from [<8013b78c>] (process_one_work+0x254/0x584)
[14620.751119]  r9:00000000 r8:b6b56c00 r7:00000080 r6:b6b53640 r5:89128200 r4:822b1a7c
[14620.751137] [<8013b538>] (process_one_work) from [<8013bb1c>] (worker_thread+0x60/0x5c4)
[14620.751153]  r10:b6b53640 r9:80e03d00 r8:b6b53658 r7:00000008 r6:b6b53640 r5:89128214
[14620.751164]  r4:89128200
[14620.751182] [<8013babc>] (worker_thread) from [<801435a4>] (kthread+0x170/0x174)
[14620.751198]  r10:adf9be74 r9:89128200 r8:8013babc r7:9b4ba000 r6:00000000 r5:ade02c80
[14620.751209]  r4:9170bf80
[14620.751225] [<80143434>] (kthread) from [<801000ec>] (ret_from_fork+0x14/0x28)
[14620.751236] Exception stack(0x9b4bbfb0 to 0x9b4bbff8)
[14620.751249] bfa0:                                     00000000 00000000 00000000 00000000
[14620.751264] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[14620.751277] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000  
[14620.751293]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:80143434
[14620.751303]  r4:ade02c80

ranma avatar Mar 06 '21 17:03 ranma

Hmmm ... I would suggest disabling cron jobs and/or processes and services one by one and seeing how that goes. As I said, once I removed backintime the issue went away (or at least is now sufficiently infrequent for me to not notice it at all). While lxpanel is crashing, I believe in some cases the crash is being prompted by something else.

jdrch avatar Mar 06 '21 17:03 jdrch

So I've enabled memory cgroups + swap on zram and put lxpanel into a limited cgroup. As expected, this morning I found it almost maxing out it's allocation limit and heavily swapping (but only the lxpanel process, thanks to the cgroup).

Time for some investigation:

(gdb) attach 3924
Attaching to program: /usr/bin/lxpanel, process 3924
[New LWP 3928]
[New LWP 3942]
[New LWP 3948]
[New LWP 3988]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
0x73822510 in g_dbus_object_get_object_path@plt () from /usr/lib/arm-linux-gnueabihf/lxpanel/plugins/bluetooth.so
(gdb) bt
#0  0x73822510 in g_dbus_object_get_object_path@plt () from /usr/lib/arm-linux-gnueabihf/lxpanel/plugins/bluetooth.so
#1  0x73824ae0 in ?? () from /usr/lib/arm-linux-gnueabihf/lxpanel/plugins/bluetooth.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) gcore lxpanel.core
warning: target file /proc/3924/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 1048576 bytes at 0x60aff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x64aff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x68aff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6cb00000.
warning: Memory read failed for corefile section, 135168 bytes at 0x6dd00000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6df00000.
warning: Memory read failed for corefile section, 135168 bytes at 0x6e700000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6e8ff000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x6f100000.
warning: Memory read failed for corefile section, 135168 bytes at 0x6f900000.
warning: Memory read failed for corefile section, 393216 bytes at 0x6fa30000.
warning: Memory read failed for corefile section, 237568 bytes at 0x6fac5000.
warning: Memory read failed for corefile section, 135168 bytes at 0x70300000.
warning: Memory read failed for corefile section, 4096 bytes at 0x70465000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x70468000.
warning: Memory read failed for corefile section, 135168 bytes at 0x74600000.
warning: Memory read failed for corefile section, 8192 bytes at 0x75b32000.
warning: Memory read failed for corefile section, 12288 bytes at 0x7639d000.
warning: Memory read failed for corefile section, 131072 bytes at 0x76ed6000.
warning: Memory read failed for corefile section, 135168 bytes at 0x7ea79000.
Saved corefile lxpanel.core

Let's try looking at that core file

$ ls -lh lxpanel.core 
-rw-r--r-- 1 pi pi 432M Mar  7 10:37 lxpanel.core
$ strings lxpanel.core  | sort | uniq -c | sort -n -k1 -r | head -20
   6272 :1.18
   4376 daRvPaRv
   2569 org.bluez.Device1
   2414 org.freedesktop.DBus.Properties
   2080 org.freedesktop.DBus.Introspectable
   1595 RSSI
   1094 UUIDs
   1094 Alias
   1094 AddressType
   1094 Address
   1092 Trusted
   1092 ServicesResolved
   1092 Paired
   1092 LegacyPairing
   1092 Connected
   1092 Blocked
   1092 Adapter
    980 MXo=
    972 ManufacturerData
    888 /org/bluez/hci0

So I think it's likely that it's the bluetooth lxpanel plugin that is leaking memory in my case, I'll try disabling the bluetooth plugin and see how it behaves after that.

ranma avatar Mar 07 '21 09:03 ranma

While lxpanal is crashing, I believe in some cases the crash is being prompted by something else.

Technically it looks more like a lockup from out-of-memory, not a crash. lxpanel is the clear culprit because its resident set size is initially small, but is at 591M out of ~1GB total at the point of the lockup. Also putting lxpanel into a cgroup prevents the lockup for me.

There's actually two separate issues here:

  1. The system seems to lock up instead of the OOM-killer triggering and killing the offending lxpanel process (at which point the system would recover and continue working since lxpanel is auto-respawned). It seems it might be that swapping on SD card is running into some kernel issue with the default kernel settings?

  2. The memory leak in lxpanel (or rather most likely the raspi-specific lxplug-bluetooth) I'll open a new bug at https://github.com/raspberrypi-ui/lxplug-bluetooth/ for this one, I think there are missing calls to g_variant_unref in cb_interface_properties. I'll also need to see if heaptrack is usable, I tried valgrinds massif yesterday and it was just too slow.

ranma avatar Mar 07 '21 11:03 ranma

@spl237 FYI

JamesH65 avatar Mar 07 '21 11:03 JamesH65

Looks like this problem has returned for me. We're now in year 3 of a serious issue that still hasn't been resolved. I'm hoping to migrate my UniFi Controller workload from my Pi to bare metal on my Debian server due to this.

jdrch avatar Apr 15 '21 18:04 jdrch

I'm having the issue everyday. Tried adding the suggested

vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

But it didn't help

$ uname -a
Linux raspberrypi 5.10.63-v7+ #1459 SMP Wed Oct 6 16:41:10 BST 2021 armv7l GNU/Linux
Nov 17 12:58:18 raspberrypi kernel: [105060.916417] INFO: task kworker/1:0:17084 blocked for more than 122 seconds.
Nov 17 12:58:19 raspberrypi kernel: [105060.916440]       Tainted: G         C        5.10.63-v7+ #1459
Nov 17 12:58:19 raspberrypi kernel: [105060.916450] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 17 12:58:19 raspberrypi kernel: [105060.916463] task:kworker/1:0     state:D stack:    0 pid:17084 ppid:     2 flags:0x00000000
Nov 17 12:58:19 raspberrypi kernel: [105060.916506] Workqueue: events_freezable mmc_rescan
Nov 17 12:58:19 raspberrypi kernel: [105060.916525] Backtrace: 
Nov 17 12:58:19 raspberrypi kernel: [105060.916560] [<809f9df0>] (__schedule) from [<809fa7c8>] (schedule+0x68/0xe4)
Nov 17 12:58:19 raspberrypi kernel: [105060.916580]  r10:822c9800 r9:ffffe000 r8:00000000 r7:00000000 r6:60000013 r5:b1b6ec80
Nov 17 12:58:19 raspberrypi kernel: [105060.916592]  r4:ffffe000
Nov 17 12:58:19 raspberrypi kernel: [105060.916615] [<809fa760>] (schedule) from [<80806328>] (__mmc_claim_host+0xe0/0x238)
Nov 17 12:58:19 raspberrypi kernel: [105060.916628]  r5:822c9a18 r4:00000002
Nov 17 12:58:19 raspberrypi kernel: [105060.916648] [<80806248>] (__mmc_claim_host) from [<808064b8>] (mmc_get_card+0x38/0x3c)
Nov 17 12:58:19 raspberrypi kernel: [105060.916667]  r10:00000000 r9:00000000 r8:00000040 r7:b6b42d00 r6:822c9a18 r5:00000000
Nov 17 12:58:19 raspberrypi kernel: [105060.916678]  r4:822ca800
Nov 17 12:58:19 raspberrypi kernel: [105060.916700] [<80806480>] (mmc_get_card) from [<80810144>] (mmc_sd_detect+0x24/0x7c)
Nov 17 12:58:19 raspberrypi kernel: [105060.916713]  r5:822c9800 r4:822c9800
Nov 17 12:58:19 raspberrypi kernel: [105060.916733] [<80810120>] (mmc_sd_detect) from [<80808d1c>] (mmc_rescan+0xdc/0x3b0)
Nov 17 12:58:19 raspberrypi kernel: [105060.916746]  r5:822c9800 r4:822c9a7c
Nov 17 12:58:19 raspberrypi kernel: [105060.916769] [<80808c40>] (mmc_rescan) from [<8013b900>] (process_one_work+0x250/0x5a0)
Nov 17 12:58:19 raspberrypi kernel: [105060.916788]  r9:00000000 r8:00000040 r7:b6b42d00 r6:b6b3f6c0 r5:825a3c00 r4:822c9a7c
Nov 17 12:58:19 raspberrypi kernel: [105060.916828] [<8013b6b0>] (process_one_work) from [<8013bcb0>] (worker_thread+0x60/0x5c4)
Nov 17 12:58:19 raspberrypi kernel: [105060.916847]  r10:b6b3f6c0 r9:80f03d00 r8:b6b3f6d8 r7:00000008 r6:b6b3f6c0 r5:825a3c14
Nov 17 12:58:19 raspberrypi kernel: [105060.916858]  r4:825a3c00
Nov 17 12:58:19 raspberrypi kernel: [105060.916879] [<8013bc50>] (worker_thread) from [<80143790>] (kthread+0x170/0x174)
Nov 17 12:58:19 raspberrypi kernel: [105060.916897]  r10:9023be74 r9:825a3c00 r8:8013bc50 r7:82bce000 r6:00000000 r5:b6325d00
Nov 17 12:58:19 raspberrypi kernel: [105060.916909]  r4:b63257c0
Nov 17 12:58:19 raspberrypi kernel: [105060.917063] [<80143620>] (kthread) from [<801000ec>] (ret_from_fork+0x14/0x28)
Nov 17 12:58:19 raspberrypi kernel: [105060.917077] Exception stack(0x82bcffb0 to 0x82bcfff8)
Nov 17 12:58:19 raspberrypi kernel: [105060.917091] ffa0:                                     00000000 00000000 00000000 00000000
Nov 17 12:58:19 raspberrypi kernel: [105060.917108] ffc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Nov 17 12:58:19 raspberrypi kernel: [105060.917124] ffe0: 00000000 00000000 00000000 00000000 00000013 00000000
Nov 17 12:58:19 raspberrypi kernel: [105060.917142]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:80143620
Nov 17 12:58:19 raspberrypi kernel: [105060.917153]  r4:b6325d00

LoSunny avatar Nov 17 '21 05:11 LoSunny

Have you tried running a 64-bit kernel? Edit config.txt to include arm_64bit=1 and reboot.

pelwell avatar Nov 17 '21 09:11 pelwell

I have the same issue every 24 hours (something around this) with my "Raspberry Pi 4 Model B Rev 1.1" with 4GB RAM.

uname command: "Linux dev-netzeal 5.10.78-v7l+ #1484 SMP Thu Nov 11 16:36:06 GMT 2021 armv7l GNU/Linux"

ThomasVolkert avatar Nov 21 '21 10:11 ThomasVolkert

I can also confirm this issue happens on Arch Linux ARM (32 bits), which by default uses raspbian's patched kernel. I must mention that, at least for me, this doesn't happen every 24-hours, it can happen anywhere from 2 hours to 2-3 days after boot. I also have enabled oomd but this doesn't seem to be a memory leak issue, since oomd never seems to kick in.

Some extra details: Kernel: 5.10.81-3-rpi-ARCH DE/WM: None, RPi is pretty much a lightweight, headless server. Host: Raspberry Pi 4 Model B Rev 1.2 Custom config.txt options: Only arm_boost is enabled.

Expand logs
<had to hard-reboot at this point>
Nov 27 23:45:52 alarmpi kernel: [<c0247240>] (process_one_work) from [<c02475f0>] (worker_thread+0x64/0x5a8)
Nov 27 23:45:42 alarmpi kernel: [<c0c10308>] (mmc_rescan) from [<c0247240>] (process_one_work+0x200/0x54c)
Nov 27 23:45:28 alarmpi kernel: [<c0c17084>] (mmc_sd_detect) from [<c0c10308>] (mmc_rescan+0xcc/0x450)
Nov 27 23:45:14 alarmpi kernel: [<c0c0d2c8>] (__mmc_claim_host) from [<c0c17084>] (mmc_sd_detect+0x14/0x6c)
Nov 27 23:45:01 alarmpi kernel: [<c0e113f8>] (schedule) from [<c0c0d2c8>] (__mmc_claim_host+0xd8/0x228)
Nov 27 23:44:46 alarmpi kernel: [<c0e10c98>] (__schedule) from [<c0e113f8>] (schedule+0x60/0xdc)
Nov 27 23:44:29 alarmpi kernel: Workqueue: events_freezable mmc_rescan
Nov 27 23:44:16 alarmpi kernel: task:kworker/1:2     state:D stack:    0 pid:16533 ppid:     2 flags:0x00000020
Nov 27 23:44:02 alarmpi kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 27 23:43:45 alarmpi kernel:       Tainted: G         C        5.10.81-3-rpi-ARCH #1
Nov 27 23:43:24 alarmpi kernel: INFO: task kworker/1:2:16533 blocked for more than 614 seconds.
Nov 27 23:43:03 alarmpi kernel: bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
Nov 27 23:42:51 alarmpi kernel: bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Nov 27 23:42:40 alarmpi kernel: bfa0:                                     00000000 00000000 00000000 00000000
Nov 27 23:42:26 alarmpi kernel: Exception stack(0xc47ebfb0 to 0xc47ebff8)
Nov 27 23:42:02 alarmpi kernel: [<c024d484>] (kthread) from [<c0200118>] (ret_from_fork+0x14/0x3c)
Nov 27 23:41:45 alarmpi kernel: [<c02475f0>] (worker_thread) from [<c024d484>] (kthread+0x14c/0x150)
Nov 27 23:41:25 alarmpi kernel: [<c0247240>] (process_one_work) from [<c02475f0>] (worker_thread+0x64/0x5a8)
Nov 27 23:41:14 alarmpi kernel: [<c0c10308>] (mmc_rescan) from [<c0247240>] (process_one_work+0x200/0x54c)
Nov 27 23:41:00 alarmpi kernel: [<c0c17084>] (mmc_sd_detect) from [<c0c10308>] (mmc_rescan+0xcc/0x450)
Nov 27 23:40:42 alarmpi kernel: [<c0c0d2c8>] (__mmc_claim_host) from [<c0c17084>] (mmc_sd_detect+0x14/0x6c)
Nov 27 23:40:28 alarmpi kernel: [<c0e113f8>] (schedule) from [<c0c0d2c8>] (__mmc_claim_host+0xd8/0x228)
Nov 27 23:40:14 alarmpi kernel: [<c0e10c98>] (__schedule) from [<c0e113f8>] (schedule+0x60/0xdc)
Nov 27 23:40:04 alarmpi kernel: Workqueue: events_freezable mmc_rescan
Nov 27 23:39:54 alarmpi kernel: task:kworker/1:2     state:D stack:    0 pid:16533 ppid:     2 flags:0x00000020
Nov 27 23:39:43 alarmpi kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 27 23:39:27 alarmpi kernel:       Tainted: G         C        5.10.81-3-rpi-ARCH #1
Nov 27 23:39:13 alarmpi kernel: INFO: task kworker/1:2:16533 blocked for more than 491 seconds.
Nov 27 23:39:02 alarmpi kernel: bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
Nov 27 23:38:50 alarmpi kernel: bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Nov 27 23:38:36 alarmpi kernel: bfa0:                                     00000000 00000000 00000000 00000000
Nov 27 23:38:23 alarmpi kernel: Exception stack(0xc47ebfb0 to 0xc47ebff8)
Nov 27 23:38:12 alarmpi kernel: [<c024d484>] (kthread) from [<c0200118>] (ret_from_fork+0x14/0x3c)
Nov 27 23:37:59 alarmpi kernel: [<c02475f0>] (worker_thread) from [<c024d484>] (kthread+0x14c/0x150)
Nov 27 23:37:47 alarmpi kernel: [<c0247240>] (process_one_work) from [<c02475f0>] (worker_thread+0x64/0x5a8)
Nov 27 23:37:38 alarmpi kernel: [<c0c10308>] (mmc_rescan) from [<c0247240>] (process_one_work+0x200/0x54c)
Nov 27 23:37:24 alarmpi kernel: [<c0c17084>] (mmc_sd_detect) from [<c0c10308>] (mmc_rescan+0xcc/0x450)
Nov 27 23:37:14 alarmpi kernel: [<c0c0d2c8>] (__mmc_claim_host) from [<c0c17084>] (mmc_sd_detect+0x14/0x6c)
Nov 27 23:37:02 alarmpi kernel: [<c0e113f8>] (schedule) from [<c0c0d2c8>] (__mmc_claim_host+0xd8/0x228)
Nov 27 23:36:51 alarmpi kernel: [<c0e10c98>] (__schedule) from [<c0e113f8>] (schedule+0x60/0xdc)
Nov 27 23:36:39 alarmpi kernel: Workqueue: events_freezable mmc_rescan
Nov 27 23:36:29 alarmpi kernel: task:kworker/1:2     state:D stack:    0 pid:16533 ppid:     2 flags:0x00000020
Nov 27 23:36:15 alarmpi kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 27 23:36:03 alarmpi kernel:       Tainted: G         C        5.10.81-3-rpi-ARCH #1
Nov 27 23:35:50 alarmpi kernel: INFO: task kworker/1:2:16533 blocked for more than 368 seconds.

rakbladsvalsen avatar Nov 28 '21 06:11 rakbladsvalsen

Had the issue recently on a Raspberry Pi 3 Model B Rev 1.2 with Linux localhost 5.10.63-v7+ #1459 SMP Wed Oct 6 16:41:10 BST 2021 armv7l GNU/Linux. It might have occurred a 3 times over the last months, but didn't check what caused the issue then.

Now had it on a Raspberry Pi 4 Model B Rev 1.1 with Linux localhost 5.10.63-v8+ #1459 SMP PREEMPT Wed Oct 6 16:42:49 BST 2021 aarch64 GNU/Linux:

syslog extract ```logs Nov 29 23:15:14 raspberrypi kernel: [ 1330.140055] INFO: task kworker/1:2:69 blocked for more than 120 seconds. Nov 29 23:15:14 raspberrypi kernel: [ 1330.140079] Tainted: G C 5.10.63-v8+ #1459 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 29 23:15:14 raspberrypi kernel: [ 1330.140105] task:kworker/1:2 state:D stack: 0 pid: 69 ppid: 2 flags:0x00000028 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140161] Workqueue: events_freezable mmc_rescan Nov 29 23:15:14 raspberrypi kernel: [ 1330.140184] Call trace: Nov 29 23:15:14 raspberrypi kernel: [ 1330.140203] __switch_to+0x110/0x168 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140218] __schedule+0x2d0/0x818 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140231] schedule+0x48/0x100 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140246] __mmc_claim_host+0xbc/0x208 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140258] mmc_get_card+0x3c/0x50 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140272] mmc_sd_detect+0x28/0x98 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140286] mmc_rescan+0xcc/0x3b8 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140305] process_one_work+0x1f4/0x4d8 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140317] worker_thread+0x50/0x480 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140332] kthread+0x148/0x158 Nov 29 23:15:14 raspberrypi kernel: [ 1330.140345] ret_from_fork+0x10/0x30 ```

On the latter I was updating some packages and doing some other write intensive stuff. Things started to go really slow and - it felt like - in bursts. The SD Card is quit old and slow. On the same hardware I didn't have the issue using the same kernel, although it was in armhf then.

peter-vanpoucke avatar Nov 29 '21 22:11 peter-vanpoucke

I'm experiencing this same panic -- typically during an apt cache update (but not always) and/or during a USB device resetting (granted the USB device disconnect typically occurs after the apt begins and is likely a watchdog timeout on the USB device -- not sure if the apt leads to the USB disconnect and the reconnect leads to the panic or some other variation). Using CDC (serial) to communicate to the USB devices.

I'm experiencing this on multiple Raspberry Pi 3 Model B Rev 1.2 using industrial SD cards from ATP. More load causes it to reproduce quicker but load doesn't seem to be the actual root cause, nor does memory.

It seems to happen more often on kernel 4.19.66-v7+ and less on 5.10.63-v7+, but still reproducible on both. It typically occurs in my case during apt events, usually when just updating package cache from an automated process and not even installing packages.

Additionally I have one or more USB devices (ATSAMD51 MCU) connected and sometimes they will disconnect and reconnect which also appears to either trigger the panic or possibly a side-effect, and this usually occurs after the apt begins (likely triggered by a watchdog on the ATSAMD51 but this is something I'm validating now via external USB sniffer to see what's occurring. It feels like some type of race condition and I can speed it up by causing load on the systems by increasing communication on the USB bus and CPU load. No swap is used and it never seems to be a memory issue as there often is 100MB+ of free memory.

This occurs with or without the suggested change

vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
syslog
2021-11-29T00:44:00.277108+00:00 testbox kernel: [152349.575566] usb 1-1.2: USB disconnect, device number 8
2021-11-29T00:44:00.575265+00:00 testbox kernel: [152349.871024] usb 1-1.2: new full-speed USB device number 10 using dwc_otg
2021-11-29T00:44:00.650556+00:00 testbox kernel: [152350.015909] usb 1-1.2: New USB device found, idVendor=6f70, idProduct=0004, bcdDevice= 1.00
2021-11-29T00:44:00.650619+00:00 testbox kernel: [152350.015926] usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
2021-11-29T00:44:00.650624+00:00 testbox kernel: [152350.015934] usb 1-1.2: Product: Test 4 Port Board
2021-11-29T00:44:00.650628+00:00 testbox kernel: [152350.015941] usb 1-1.2: Manufacturer: Test
2021-11-29T00:44:00.650662+00:00 testbox kernel: [152350.015949] usb 1-1.2: SerialNumber: B90C448D5335353334202020FF070A33
2021-11-29T00:44:00.650668+00:00 testbox kernel: [152350.016894] cdc_acm 1-1.2:1.0: ttyACM0: USB ACM device
2021-11-29T00:44:00.650674+00:00 testbox kernel: [152350.019280] usb 1-1.3: USB disconnect, device number 9
2021-11-29T00:44:00.960531+00:00 testbox kernel: [152350.320911] usb 1-1.3: new full-speed USB device number 11 using dwc_otg
2021-11-29T00:44:01.110574+00:00 testbox kernel: [152350.473538] usb 1-1.3: New USB device found, idVendor=6f70, idProduct=0005, bcdDevice= 1.00
2021-11-29T00:44:01.110662+00:00 testbox kernel: [152350.473555] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
2021-11-29T00:44:01.110669+00:00 testbox kernel: [152350.473563] usb 1-1.3: Product: Test 8 Port Board
2021-11-29T00:44:01.110672+00:00 testbox kernel: [152350.473570] usb 1-1.3: Manufacturer: Test
2021-11-29T00:44:01.110676+00:00 testbox kernel: [152350.473578] usb 1-1.3: SerialNumber: 64E71E0F5334574347202020FF072442
2021-11-29T00:44:01.110679+00:00 testbox kernel: [152350.475746] cdc_acm 1-1.3:1.0: ttyACM1: USB ACM device
2021-11-29T00:46:23.420115+00:00 testbox kernel: [152492.531150] INFO: task kworker/1:2:28942 blocked for more than 122 seconds.
2021-11-29T00:46:24.544424+00:00 testbox kernel: [152492.531165]       Tainted: G         C        5.10.63-v7+ #1459
2021-11-29T00:46:24.544453+00:00 testbox kernel: [152492.531171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2021-11-29T00:46:24.544460+00:00 testbox kernel: [152492.531179] task:kworker/1:2     state:D stack:    0 pid:28942 ppid:     2 flags:0x00000000
2021-11-29T00:46:24.544466+00:00 testbox kernel: [152492.531206] Workqueue: events_freezable mmc_rescan
2021-11-29T00:46:24.544471+00:00 testbox kernel: [152492.531218] Backtrace:
2021-11-29T00:46:24.544476+00:00 testbox kernel: [152492.531239] [<809f9df0>] (__schedule) from [<809fa7c8>] (schedule+0x68/0xe4)
2021-11-29T00:46:24.544482+00:00 testbox kernel: [152492.531251]  r10:81e91800 r9:ffffe000 r8:00000000 r7:00000000 r6:60000013 r5:8a86ec80
2021-11-29T00:46:24.544487+00:00 testbox kernel: [152492.531258]  r4:ffffe000
2021-11-29T00:46:24.544492+00:00 testbox kernel: [152492.531270] [<809fa760>] (schedule) from [<80806328>] (__mmc_claim_host+0xe0/0x238)
2021-11-29T00:46:24.544498+00:00 testbox kernel: [152492.531278]  r5:81e91a18 r4:00000002
2021-11-29T00:46:24.544503+00:00 testbox kernel: [152492.531289] [<80806248>] (__mmc_claim_host) from [<808064b8>] (mmc_get_card+0x38/0x3c)
2021-11-29T00:46:24.544509+00:00 testbox kernel: [152492.531300]  r10:00000000 r9:00000000 r8:00000040 r7:baae2d00 r6:81e91a18 r5:00000000
2021-11-29T00:46:24.544515+00:00 testbox kernel: [152492.531307]  r4:81e92800
2021-11-29T00:46:24.544520+00:00 testbox kernel: [152492.531320] [<80806480>] (mmc_get_card) from [<80810144>] (mmc_sd_detect+0x24/0x7c)
2021-11-29T00:46:24.544526+00:00 testbox kernel: [152492.531328]  r5:81e91800 r4:81e91800
2021-11-29T00:46:24.544531+00:00 testbox kernel: [152492.531340] [<80810120>] (mmc_sd_detect) from [<80808d1c>] (mmc_rescan+0xdc/0x3b0)
2021-11-29T00:46:24.544536+00:00 testbox kernel: [152492.531348]  r5:81e91800 r4:81e91a7c
2021-11-29T00:46:24.544541+00:00 testbox kernel: [152492.531362] [<80808c40>] (mmc_rescan) from [<8013b900>] (process_one_work+0x250/0x5a0)
2021-11-29T00:46:24.544547+00:00 testbox kernel: [152492.531372]  r9:00000000 r8:00000040 r7:baae2d00 r6:baadf6c0 r5:84a8e100 r4:81e91a7c
2021-11-29T00:46:24.544552+00:00 testbox kernel: [152492.531384] [<8013b6b0>] (process_one_work) from [<8013bcb0>] (worker_thread+0x60/0x5c4)
2021-11-29T00:46:24.544557+00:00 testbox kernel: [152492.531395]  r10:baadf6c0 r9:80f03d00 r8:baadf6d8 r7:00000008 r6:baadf6c0 r5:84a8e114
2021-11-29T00:46:24.544572+00:00 testbox kernel: [152492.531401]  r4:84a8e100
2021-11-29T00:46:24.544578+00:00 testbox kernel: [152492.531413] [<8013bc50>] (worker_thread) from [<80143790>] (kthread+0x170/0x174)
2021-11-29T00:46:24.544584+00:00 testbox kernel: [152492.531424]  r10:8ab97e74 r9:84a8e100 r8:8013bc50 r7:87922000 r6:00000000 r5:882b0880
2021-11-29T00:46:24.544589+00:00 testbox kernel: [152492.531431]  r4:82718bc0
2021-11-29T00:46:24.544595+00:00 testbox kernel: [152492.531442] [<80143620>] (kthread) from [<801000ec>] (ret_from_fork+0x14/0x28)
2021-11-29T00:46:24.544600+00:00 testbox kernel: [152492.531449] Exception stack(0x87923fb0 to 0x87923ff8)
2021-11-29T00:46:24.544606+00:00 testbox kernel: [152492.531458] 3fa0:                                     00000000 00000000 00000000 00000000
2021-11-29T00:46:24.544611+00:00 testbox kernel: [152492.531468] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
2021-11-29T00:46:24.544621+00:00 testbox kernel: [152492.531476] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
2021-11-29T00:46:24.544627+00:00 testbox kernel: [152492.531487]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:80143620
2021-11-29T00:46:24.544633+00:00 testbox kernel: [152492.531494]  r4:882b0880
2021-11-29T00:48:26.338733+00:00 testbox kernel: [152615.411325] INFO: task kworker/1:2:28942 blocked for more than 245 seconds.
2021-11-29T00:48:32.226785+00:00 testbox kernel: [152615.411349]       Tainted: G         C        5.10.63-v7+ #1459
2021-11-29T00:48:35.222727+00:00 testbox kernel: [152615.411355] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2021-11-29T00:48:38.237021+00:00 testbox kernel: [152615.411363] task:kworker/1:2     state:D stack:    0 pid:28942 ppid:     2 flags:0x00000000
2021-11-29T00:31:09.941737+00:00 testbox systemd-modules-load[124]: Inserted module 'i2c_dev'
2021-11-29T00:31:09.950765+00:00 testbox systemd-fsck[113]: e2fsck 1.43.4 (31-Jan-2017)
2021-11-29T00:31:09.950873+00:00 testbox systemd-fsck[113]: /dev/mmcblk0p2: clean, 208768/427392 files, 1170382/1706752 blocks
2021-11-29T00:31:09.950930+00:00 testbox systemd[1]: Started Remount Root and Kernel File Systems.
2021-11-29T00:31:09.942737+00:00 testbox kernel: [    0.000000] Booting Linux on physical CPU 0x0
2021-11-29T00:31:09.951018+00:00 testbox kernel: [    0.000000] Linux version 5.10.63-v7+ (dom@buildbot) (arm-linux-gnueabihf-gcc-8 (Ubuntu/Linaro 8.4.0-3ubuntu1) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #1459 SMP Wed Oct 6 16:41:10 BST 2021

samyk avatar Nov 29 '21 23:11 samyk

Just wanted to add that I was seeing this on a Raspberry Pi 4 2GB running the 64 bit Ubuntu 20.04 image. I was running the Unifi Network Java app in a container and had this happen quite regularly. No amount of tuning the kernel or JVM seemed to fix it. I eventually removed the Unifi container altogether and this issue mostly stopped. Although I was still getting occasional lock ups until I finally got rid of all of my running LXC/LXD containers. Since abandoning them I haven't hit this issue since.

It looks like this problem:

  • Isn't specific to a Pi. Both 3s and 4s seem impacted
  • Happens to both 32 and 64 bit kernels
  • Isn't specific to the Raspberry Pi OS kernel since it happens to the Ubuntu one too (although they may apply similar patches)

I am somewhat curious if others are getting this when running containers and if the issue stops when you turn them off.

kmark avatar Nov 29 '21 23:11 kmark