linux
linux copied to clipboard
Kernel panic during reboot with PCIe-Ethernet card
Describe the bug
I'm using a CM4 on the official I/O board with Microchip's official LAN7430 Eval Board. This chip is intended for use on a custom I/O board. During preliminary investigations, it was observed that during many reboots, sometimes (approximately every 1,500 times) a kernel panic occurred during the boot process. This happened while accessing the LAN7430. The error can also be provoked much faster during normal operation if the chip's IP address is changed in a loop while the link is active.
sudo bash -c "while true; do ifconfig eth1 169.254.237.236; ifconfig eth1 169.254.237.237; ifconfig eth1 169.254.237.238; done"
However, the error only occurs if the I²C on the CM4 being used is enabled once (Dtparam=i2c_vc=on in the config.txt file). The error does not occur before that. Even if the I²C is then disabled, the error still persists.
With support from Microchip, we were able to find a workaround, although it does not solve the problem. By disabling the PowerSafe mode l1_aspm, the error can be avoided. This can be done with the following command:
sudo sh -c "echo 0 > /sys/bus/pci/devices/0000:01:00.0/link/l1_aspm
One suspicion from the support team was that the host controller does not support this PowerSafe mode, or at least encounters issues with it.
Steps to reproduce the behaviour
To retest everything, I used a new Raspberry CM4 (4Gb RAM, 16Gb Flash, no WiFi/BT) and installed the current official image (Raspberry PI OS x64 Lite).
Here's my procedure:
• Flashed the current image onto the CM4 via USB • Booted with I/O Board, Eval Board LAN7430, and HDMI monitor, and set up user and keyboard layout • Ran the test with the following command (sudo bash -c "while true; do ifconfig eth1 169.254.237.236; ifconfig eth1 169.254.237.237; ifconfig eth1 169.254.237.238; done"), no apparent problem detected, test runs for several minutes • Enabled serial interface in the file /boot/config.txt by adding the line "enable_uart=1" at the end of the file. Then rebooted with power cycle. • Ran the test again and still no problem detected. Test runs for several minutes • Enabled I²C ("dtparam=i2c_vc=on" added at the end of the file /boot/config.txt). Then rebooted and power cycled. • Ran the test again, and this time, the kernel panic occurred after a few seconds. • Power cycled and deleted the settings (I2C and serial console) from the file. • Ran the test again, but the kernel panic occurred consistently from now on, even after reinstalling the image completely on the CM4.
Device (s)
Raspberry Pi CM4
System
Which OS and version: Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 2acf7afcba7d11500313a7b93bb55a2aae20b2d6, stage2
Which firmware version: Oct 17 2023 15:39:16 Copyright (c) 2012 Broadcom version 30f0c5e4d076da3ab4f341d88e7d505760b93ad7 (clean) (release) (start)
Which kernel version: Linux raspberrypi 6.1.0-rpi7-rpi-v8 #1 SMP PREEMPT Debian 1:6.1.63-1+rpt1 (2023-11-24) aarch64 GNU/Linux
Logs
pi@raspberrypi:~$ sudo bash -c "while true; do ifconfig eth1 169.254.237.236; ifconfig eth1 169.254.237.237; ifconfig eth1 169.254.237.238; done" [ 75.123576] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError [ 75.123592] CPU: 1 PID: 469 Comm: avahi-daemon Tainted: G C 6.1.0-rpi7-rpi-v8 #1 Debian 1:6.1.63-1+rpt1 [ 75.123599] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT) [ 75.123601] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 75.123604] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0 [ 75.123606] pc : el1_interrupt+0x20/0x70 [ 75.123621] lr : el1h_64_irq_handler+0x18/0x2c [ 75.123626] sp : ffffffc00896b8f0 [ 75.123628] x29: ffffffc00896b8f0 x28: ffffff8105283e00 x27: 0000000000000000 [ 75.123636] x26: ffffff810390c288 x25: 000000000000042c x24: 0000000000000005 [ 75.123642] x23: 0000000060000005 x22: ffffffe7c2eee124 x21: ffffffc00896ba70 [ 75.123648] x20: ffffffe8312100d0 x19: ffffffc00896b920 x18: 0000000000000000 [ 75.123653] x17: 0000000000000000 x16: ffffffe83133cf24 x15: 0000007fd2ba0e68 [ 75.123658] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 75.123663] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffe83133cf78 [ 75.123667] x8 : ffffffc00896bb38 x7 : 0000000000000000 x6 : 00000000f43158e3 [ 75.123672] x5 : 00ffffffffffffff x4 : 00284af0d710dfc0 x3 : 0000000000000018 [ 75.123677] x2 : 0000000000002c20 x1 : 00000000000000c0 x0 : ffffffc00896b920 [ 75.123683] Kernel panic - not syncing: Asynchronous SError Interrupt [ 75.123686] CPU: 1 PID: 469 Comm: avahi-daemon Tainted: G C 6.1.0-rpi7-rpi-v8 #1 Debian 1:6.1.63-1+rpt1 [ 75.123691] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT) [ 75.123693] Call trace: [ 75.123694] dump_backtrace.part.0+0xec/0x100 [ 75.123700] show_stack+0x20/0x30 [ 75.123704] dump_stack_lvl+0x88/0xb4 [ 75.123708] dump_stack+0x18/0x34 [ 75.123712] panic+0x1a0/0x370 [ 75.123719] nmi_panic+0xb4/0xbc [ 75.123725] arm64_serror_panic+0x78/0x84 [ 75.123729] is_valid_bugaddr+0x0/0x30 [ 75.123733] el1h_64_error_handler+0x38/0x50 [ 75.123737] el1h_64_error+0x64/0x68 [ 75.123741] el1_interrupt+0x20/0x70 [ 75.123745] el1h_64_irq_handler+0x18/0x2c [ 75.123750] el1h_64_irq+0x64/0x68 [ 75.123753] lan743x_csr_wait_for_bit_atomic.constprop.0+0x50/0xb0 [lan743x] [ 75.123772] lan743x_netdev_set_multicast+0x17c/0x270 [lan743x] [ 75.123782] __dev_set_rx_mode+0x6c/0xac [ 75.123788] __dev_mc_add+0x98/0xbc [ 75.123791] dev_mc_add+0x1c/0x30 [ 75.123795] igmp_group_added+0x194/0x210 [ 75.123801] ____ip_mc_inc_group+0x194/0x2a0 [ 75.123807] __ip_mc_join_group+0x110/0x180 [ 75.123812] ip_mc_join_group+0x1c/0x30 [ 75.123817] do_ip_setsockopt+0x1090/0x11a4 [ 75.123823] ip_setsockopt+0x3c/0xac [ 75.123828] udp_setsockopt+0x24/0x4c [ 75.123834] sock_common_setsockopt+0x24/0x30 [ 75.123840] __sys_setsockopt+0xe8/0x1e0 [ 75.123844] __arm64_sys_setsockopt+0x30/0x40 [ 75.123848] invoke_syscall+0x50/0x120 [ 75.123854] el0_svc_common.constprop.0+0x68/0x124 [ 75.123860] do_el0_svc+0x34/0xd0 [ 75.123865] el0_svc+0x30/0x94 [ 75.123870] el0t_64_sync_handler+0xf4/0x120 [ 75.123874] el0t_64_sync+0x18c/0x190 [ 75.123880] SMP: stopping secondary CPUs [ 75.130335] Kernel Offset: 0x2829200000 from 0xffffffc008000000 [ 75.130337] PHYS_OFFSET: 0x0 [ 75.130339] CPU features: 0x80000,2013c080,0000421b [ 75.130342] Memory Limit: none
Additional context
No response
I hope that the described error has been brought up in the right place here. If I need to address this with a different manufacturer or at a different location, I can also arrange that.
As special hardware (in this case, the Microchip Eval Board with the PCIe2Ethernet chip) is required for reproducing the error, I could also offer to send you the necessary hardware for this purpose to expedite the process if possible.
Please update to the latest kernel in apt (6.6.y) and retest.
The error persists, even with the current kernel 6.6.20.
pi@raspberrypi:/$ uname -a
Linux raspberrypi 6.6.20+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.20-1+rpt1 (2024-03-07) aarch64 GNU/Linux
pi@raspberrypi:/$ sudo bash -c "while true; do ifconfig eth1 169.254.237.236; ifconfig eth1 169.254.237.237; ifconfig eth1 169.254.237.238; done"
[ 1055.724841] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
[ 1055.724854] CPU: 1 PID: 8413 Comm: ifconfig Tainted: G C 6.6.20+rpt-rpi-v8 #1 Debian 1:6.6.20-1+rpt1
[ 1055.724861] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT)
[ 1055.724863] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 1055.724868] pc : el1_interrupt+0x20/0x68
[ 1055.724880] lr : el1h_64_irq_handler+0x18/0x28
[ 1055.724886] sp : ffffffc082f938a0
[ 1055.724888] x29: ffffffc082f938a0 x28: ffffff8040355c40 x27: 0000000000000000
[ 1055.724898] x26: ffffff8041318290 x25: 0000000000000434 x24: 0000000000000006
[ 1055.724904] x23: 0000000060000005 x22: ffffffe6e730b848 x21: ffffffc082f93a20
[ 1055.724911] x20: ffffffe75a2100e0 x19: ffffffc082f938d0 x18: 0000000000000000
[ 1055.724916] x17: 0000000000000000 x16: ffffffe75aecbfc0 x15: 0000007fd5050a30
[ 1055.724921] x14: 0000000000000000 x13: 0000000000000020 x12: 0101010101010101
[ 1055.724926] x11: 7f7f7f7f7f7f7f7f x10: ffffffbfbf1bc484 x9 : ffffffe6e730bcdc
[ 1055.724932] x8 : ffffffc082f93ae8 x7 : 0000000000000000 x6 : 0000000080808080
[ 1055.724938] x5 : 0000000000000000 x4 : 0000000000000008 x3 : 0000000000000030
[ 1055.724943] x2 : ffffffc080492428 x1 : 00000000000000c0 x0 : ffffffc082f938d0
[ 1055.724950] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 1055.724953] CPU: 1 PID: 8413 Comm: ifconfig Tainted: G C 6.6.20+rpt-rpi-v8 #1 Debian 1:6.6.20-1+rpt1
[ 1055.724957] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT)
[ 1055.724960] Call trace:
[ 1055.724962] dump_backtrace+0xa0/0x100
[ 1055.724969] show_stack+0x20/0x38
[ 1055.724972] dump_stack_lvl+0x48/0x60
[ 1055.724978] dump_stack+0x18/0x28
[ 1055.724981] panic+0x328/0x390
[ 1055.724988] nmi_panic+0x94/0xa0
[ 1055.724993] arm64_serror_panic+0x78/0x90
[ 1055.724997] do_serror+0x44/0x88
[ 1055.725001] el1h_64_error_handler+0x30/0x48
[ 1055.725006] el1h_64_error+0x64/0x68
[ 1055.725009] el1_interrupt+0x20/0x68
[ 1055.725013] el1h_64_irq_handler+0x18/0x28
[ 1055.725017] el1h_64_irq+0x64/0x68
[ 1055.725020] lan743x_csr_wait_for_bit_atomic.constprop.0+0x50/0xb0 [lan743x]
[ 1055.725039] lan743x_netdev_set_multicast+0x17c/0x278 [lan743x]
[ 1055.725050] __dev_set_rx_mode+0x70/0xb8
[ 1055.725058] __dev_change_flags+0xac/0x218
[ 1055.725064] dev_change_flags+0x2c/0x80
[ 1055.725071] devinet_ioctl+0x3a4/0x6b8
[ 1055.725078] inet_ioctl+0x21c/0x238
[ 1055.725084] sock_do_ioctl+0x64/0x120
[ 1055.725088] sock_ioctl+0x120/0x390
[ 1055.725092] __arm64_sys_ioctl+0xb4/0x100
[ 1055.725098] invoke_syscall+0x50/0x128
[ 1055.725105] el0_svc_common.constprop.0+0x48/0xf0
[ 1055.725111] do_el0_svc+0x24/0x38
[ 1055.725116] el0_svc+0x40/0xe8
[ 1055.725121] el0t_64_sync_handler+0x100/0x130
[ 1055.725126] el0t_64_sync+0x190/0x198
[ 1055.725130] SMP: stopping secondary CPUs
[ 1055.725135] Kernel Offset: 0x26da200000 from 0xffffffc080000000
[ 1055.725138] PHYS_OFFSET: 0x0
[ 1055.725139] CPU features: 0x0,80000201,3c020000,0000421b
[ 1055.725143] Memory Limit: none
Maybe this thread would help? https://github.com/raspberrypi/linux/issues/5659 If I get a chance and I understand the problem correctly, I hope to try it tonight. I am getting kernel panic just about every other reboot or maybe even 3 out of 4 reboots.