Automatic restart of MCP2518FD canbus causes oops
Describe the bug
Setting restart-ms to a non zero value (100ms is the one I've been using) results in an occasional BUG about echo SKBs, and slightly less commonly an OOPS
Steps to reproduce the behaviour
- Set up a raspberry Pi 4 with the Waveshare 2-channel CAN FD hat SKU 17075 (the one with the MCP2518FD). I installed the latest Raspberry Pi OS Lite 64 and did an
apt updateandapt upgrade. - Connect one of the two CAN buses to something that will acknowledge your frames (I just connected both of them together but I have verified this works with a PCAN-USB as well).
- install this udev rules file:
KERNELS=="spi0.0", SUBSYSTEMS=="spi", DRIVERS=="mcp251xfd", ACTION=="add|bind|change", NAME="canbus0", ATTR{tx_queue_len}="1024"
KERNELS=="spi0.1", SUBSYSTEMS=="spi", DRIVERS=="mcp251xfd", ACTION=="add|bind|change", NAME="canbus1", ATTR{tx_queue_len}="1024"
KERNELS=="spi1.0", SUBSYSTEMS=="spi", DRIVERS=="mcp251xfd", ACTION=="add|bind|change", NAME="canbus2", ATTR{tx_queue_len}="1024"
KERNELS=="spi1.1", SUBSYSTEMS=="spi", DRIVERS=="mcp251xfd", ACTION=="add|bind|change", NAME="canbus3", ATTR{tx_queue_len}="1024"
- Disable network manager, enable systemd-networkd, and add this .network file to
/etc/systemd/network:
[Match]
Name=canbus*
[CAN]
BitRate=500K
BusErrorReporting=yes
RestartSec=500ms
[Link]
RequiredForOnline=no
- Add these lines to
config.txt:
dtparam=spi=on
dtoverlay=spi1-3cs
dtoverlay=mcp251xfd,spi0-0,interrupt=25
dtoverlay=mcp251xfd,spi1-0,interrupt=24
- install can-utils
- reboot.
- verify that
canbus0andcanbus2are up (if your jumpers are set differently, you might have 0 and 1 instead). - Assuming
canbus0has a partner that will acknowledge, startcangen -I 123 -g 1 canbus0(it also works on canbus2 if it has a partner). Verify that traffic is flowing via counters orcandump. - Briefly short the CAN pair together with a wire, screwdriver, etc, while watching
journalctl -f. - Repeat step 10 until you see an OOPS. It should not take many attempts (usually takes less than 30 seconds)
Device (s)
Raspberry Pi 4 Mod. B
System
Logs
Included in raspinfo
Additional context
This is following up from an email conversation between myself and @marckleinebudde . I thought it best to move it somewhere we can track it rather than continue sending many emails.
I'm reporting this in raspberrypi/linux because I've reproduced it on multiple distros but don't have another SPI master to try it with to see if it's not Pi-specific.
I have reproduced this issue with 6.6 kernels and 6.12 kernels, on both yocto and raspberry pi OS, with the waveshare hat as well as our custom board, with CM4 and Pi 4 model B, etc.
I tried cherry-picking some later commits into the 6.12 kernel and it doesn't really help.
Marc suggested this patch which did not seem to cure the issue:
--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
@@ -759,6 +759,9 @@ static void mcp251xfd_chip_stop(struct mcp251xfd_priv *priv,
{
priv->can.state = state;
+ hrtimer_cancel(&priv->rx_irq_timer);
+ hrtimer_cancel(&priv->tx_irq_timer);
+ cancel_work_sync(&priv->tx_work);
mcp251xfd_chip_interrupts_disable(priv);
mcp251xfd_chip_rx_int_disable(priv);
mcp251xfd_timestamp_stop(priv);
He also suggested this patch which I have yet to try but will try as soon as I submit this (been spending time verifying that it's not particular to our distro or hardware):
--- a/drivers/net/can/dev/dev.c
+++ b/drivers/net/can/dev/dev.c
@@ -185,7 +185,9 @@ static void can_restart_work(struct work_struct *work)
struct can_priv *priv = container_of(dwork, struct can_priv,
restart_work);
+ netif_tx_lock(priv->dev);
can_restart(priv->dev);
+ netif_tx_unlock(priv->dev);
}
int can_restart_now(struct net_device *dev)
Can confirm that second patch does not work, I get kernel: BUG: scheduling while atomic: kworker/3:0/36/0x00000002
In a mail you mentioned, that you have connected canbus2 and canbus3 and are running cangen -I 123 -g 0 canbus2 and cangen -I 123 -g 0 canbus3 simultaneously.
You are not allowed to send the same CAN-ID (here 123) from 2 different CAN controllers on the same bus (at the same time). This will lead to a bus-off. Never the less, a bus-off should not cause the Linux kernel to throw any errors or OOPS.
@marckleinebudde Thank you for pointing this out. I adjusted this to use different CAN IDs from each CAN controller a while ago just to make the dumps easier to see. It did not end up making a difference, and in fact you only need to transmit on one of the CAN interfaces to trigger this bug.
If I remove the restart-ms parameter and instead manually bring the interface down and back up, it recovers just fine. I have not tried ip link set canbus0 type can restart to see if that crashes however. I could do that.
Tried manual restarting, wasn't able to cause an oops. Oddly enough, when I re-enabled automatic restart, I still wasn't able to cause an oops (did get the echo_skb errors however). At first I thought it was some interaction with the ethernet driver (since I had switched from SSH to serial console) but now it's reliably crashing either way.
I will collect a dump now.
Edit: tried to collect a dump with Marc's instructions, no file was created. Maybe it needs to be a debug kernel?
Edit 2: Marc mentioned that this particular failure wouldn't result in a dump, my bad for assuming
@marckleinebudde We are seeing this on 6.6.64 with PREEMPT_RT patches and cherry picked https://github.com/raspberrypi/linux/commit/50d66c86262f31d0889971498bbeefa5d9955b10
From a user's bug report:
[ 3002.009750] mcp251xfd spi2.1 can_s1: bus-off, scheduling restart in 1000 ms
[ 3003.048006] mcp251xfd spi2.1 can_s1: can_put_echo_skb: BUG! echo_skb 2 is occupied!
[ 3003.048071] mcp251xfd spi2.1 can_s1: can_put_echo_skb: BUG! echo_skb 3 is occupied!
[ 3003.048639] ------------[ cut here ]------------
[ 3003.048643] kernel BUG at lib/dynamic_queue_limits.c:27!
[ 3003.048647] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT_RT SMP
[ 3003.048651] Modules linked in: can_raw i2c_dev can_heartbeat cdc_acm xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter br_netfilter bridge usb_f_rndis usb_f_ecm u_ether libcomposite 8021q garp stp llc overlay brcmfmac_wcc vc4 snd_soc_hdmi_codec drm_display_helper cec drm_dma_helper aes_ce_blk aes_ce_cipher drm_kms_helper ghash_ce brcmfmac gf128mul libaes snd_soc_core brcmutil sha2_ce snd_compress spidev sha256_arm64 snd_pcm_dmaengine cfg80211 rpivid_hevc(C) sha1_ce raspberrypi_hwmon snd_pcm pisp_be snd_timer v4l2_mem2mem spi_bcm2835 rfkill snd videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 gpio_keys videodev videobuf2_common mc dwc2 raspberrypi_gpiomem rp1_pio i2c_designware_platform i2c_designware_core rp1_mailbox rp1_adc rp1 spi_dw_mmio spi_dw nvmem_rmem uio_pdrv_genirq uio sch_fq_codel fuse drm drm_panel_orientation_quirks backlight ip_tables x_tables ipv6
[ 3003.048721] CPU: 2 PID: 915 Comm: irq/168-spi2.1 Tainted: G C 6.6.64-rt47-v8-16k #1
[ 3003.048724] Hardware name: Raspberry Pi Compute Module 5 Rev 1.0 (DT)
[ 3003.048725] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3003.048728] pc : dql_completed+0x158/0x1b0
[ 3003.048735] lr : mcp251xfd_handle_tefif+0x3ac/0x430
[ 3003.048738] sp : ffffffc08125bbd0
[ 3003.048739] x29: ffffffc08125bcb0 x28: ffffffc08125bc00 x27: 0000000000000002
[ 3003.048743] x26: ffffff800af71840 x25: ffffff800af70b70 x24: 0000000000000022
[ 3003.048745] x23: 0000000000000002 x22: 0000000000000000 x21: ffffff800af70000
[ 3003.048748] x20: ffffff80089adc00 x19: ffffff800af70a00 x18: ffffffc082c43a48
[ 3003.048751] x17: ffffffb07a42d000 x16: ffffffc080010000 x15: 0000007f8246d000
[ 3003.048754] x14: 000000000000000f x13: 1ffffff001a7a8a1 x12: ffffffc082c43a48
[ 3003.048756] x11: 0000000000000169 x10: 0000000000000194 x9 : ffffffd084d5a758
[ 3003.048759] x8 : 0000000000000022 x7 : 0000000000000000 x6 : 0000000000000022
[ 3003.048762] x5 : ffffffd0856f8000 x4 : 0000000000000004 x3 : 0000000000000000
[ 3003.048764] x2 : 0000000000000000 x1 : 0000000000000022 x0 : ffffff80089adcc0
[ 3003.048767] Call trace:
[ 3003.048768] dql_completed+0x158/0x1b0
[ 3003.048771] mcp251xfd_irq+0x410/0xd90
[ 3003.048775] irq_thread_fn+0x34/0xb8
[ 3003.048779] irq_thread+0x18c/0x260
[ 3003.048782] kthread+0x10c/0x120
[ 3003.048786] ret_from_fork+0x10/0x20
[ 3003.048791] Code: 7a401860 540001eb 2a0403e3 17ffffc3 (d4210000)
[ 3003.048793] ---[ end trace 0000000000000000 ]---
[ 3003.048795] note: irq/168-spi2.1[915] exited with irqs disabled
[ 3003.048797] note: irq/168-spi2.1[915] exited with preempt_count 1
[ 3003.048806] genirq: exiting task "irq/168-spi2.1" (915) is an active IRQ thread (irq 168)
[ 3003.050122] mcp251xfd spi2.1 can_s1: can_put_echo_skb: BUG! echo_skb 4 is occupied!
[ 3003.051047] ------------[ cut here ]------------
[ 3003.051051] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x110/0x128
[ 3003.051062] Modules linked in: can_raw i2c_dev can_heartbeat cdc_acm xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter br_netfilter bridge usb_f_rndis usb_f_ecm u_ether libcomposite 8021q garp stp llc overlay brcmfmac_wcc vc4 snd_soc_hdmi_codec drm_display_helper cec drm_dma_helper aes_ce_blk aes_ce_cipher drm_kms_helper ghash_ce brcmfmac gf128mul libaes snd_soc_core brcmutil sha2_ce snd_compress spidev sha256_arm64 snd_pcm_dmaengine cfg80211 rpivid_hevc(C) sha1_ce raspberrypi_hwmon snd_pcm pisp_be snd_timer v4l2_mem2mem spi_bcm2835 rfkill snd videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 gpio_keys videodev videobuf2_common mc dwc2 raspberrypi_gpiomem rp1_pio i2c_designware_platform i2c_designware_core rp1_mailbox rp1_adc rp1 spi_dw_mmio spi_dw nvmem_rmem uio_pdrv_genirq uio sch_fq_codel fuse drm drm_panel_orientation_quirks backlight ip_tables x_tables ipv6
[ 3003.051124] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D C 6.6.64-rt47-v8-16k #1
[ 3003.051127] Hardware name: Raspberry Pi Compute Module 5 Rev 1.0 (DT)
[ 3003.051128] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3003.051131] pc : ct_kernel_exit.constprop.0+0x110/0x128
[ 3003.051133] lr : ct_idle_enter+0x10/0x20
[ 3003.051135] sp : ffffffc0800ebdc0
[ 3003.051136] x29: ffffffc0800ebdc0 x28: 0000000000000000 x27: 0000000000000000
[ 3003.051138] x26: 0000000000000000 x25: ffffff8006340000 x24: 0000000000000000
[ 3003.051141] x23: 0000000000000000 x22: ffffffd0853286f8 x21: ffffffd0856f79d4
[ 3003.051143] x20: ffffffd0856f78b0 x19: ffffff80ff753f50 x18: 0000000000000000
[ 3003.051146] x17: 0000000000000000 x16: ffffffd08413bb30 x15: 0000007fc64bea98
[ 3003.051148] x14: 0000000000000000 x13: 0000000100000100 x12: 0000000100000080
[ 3003.051151] x11: 0000000000000196 x10: 0000000000001ae0 x9 : ffffffd0840ee868
[ 3003.051153] x8 : ffffffc0800ebcf8 x7 : 0000000000000000 x6 : 0000000000000001
[ 3003.051156] x5 : 4000000000000000 x4 : ffffffb07a42d000 x3 : ffffffc0800ebdc0
[ 3003.051158] x2 : ffffffd085326f50 x1 : ffffffd085326f50 x0 : 4000000000000002
[ 3003.051161] Call trace:
[ 3003.051162] ct_kernel_exit.constprop.0+0x110/0x128
[ 3003.051164] ct_idle_enter+0x10/0x20
[ 3003.051166] default_idle_call+0x44/0x180
[ 3003.051169] do_idle+0x1d8/0x218
[ 3003.051173] cpu_startup_entry+0x3c/0x50
[ 3003.051174] secondary_start_kernel+0x138/0x160
[ 3003.051178] __secondary_switched+0xb8/0xc0
[ 3003.051181] ---[ end trace 0000000000000000 ]---
[ 3003.052632] mcp251xfd spi2.1 can_s1: can_put_echo_skb: BUG! echo_skb 5 is occupied!
[ 3003.052671] mcp251xfd spi2.1 can_s1: can_put_echo_skb: BUG! echo_skb 6 is occupied!
[ 3003.499085] irq 168: nobody cared (try booting with the "irqpoll" option)
[ 3003.499090] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D WC 6.6.64-rt47-v8-16k #1
[ 3003.499094] Hardware name: Raspberry Pi Compute Module 5 Rev 1.0 (DT)
[ 3003.499095] Call trace:
[ 3003.499096] dump_backtrace+0xa0/0x100
[ 3003.499101] show_stack+0x20/0x38
[ 3003.499103] dump_stack_lvl+0x60/0xb0
[ 3003.499108] dump_stack+0x18/0x28
[ 3003.499110] __report_bad_irq+0x40/0xf8
[ 3003.499114] note_interrupt+0x330/0x398
[ 3003.499116] handle_irq_event+0xb4/0xd8
[ 3003.499118] handle_level_irq+0xc0/0x1d0
[ 3003.499121] generic_handle_irq+0x38/0x58
[ 3003.499123] rp1_gpio_irq_handler+0x138/0x208
[ 3003.499126] generic_handle_irq+0x38/0x58
[ 3003.499127] rp1_chained_handle_irq+0x80/0x128
[ 3003.499131] generic_handle_domain_irq+0x34/0x58
[ 3003.499133] gic_handle_irq+0x4c/0xd8
[ 3003.499134] call_on_irq_stack+0x24/0x58
[ 3003.499137] do_interrupt_handler+0x88/0x98
[ 3003.499139] el1_interrupt+0x34/0x68
[ 3003.499143] el1h_64_irq_handler+0x18/0x28
[ 3003.499146] el1h_64_irq+0x64/0x68
[ 3003.499147] default_idle_call+0x70/0x180
[ 3003.499149] do_idle+0x1d8/0x218
[ 3003.499151] cpu_startup_entry+0x40/0x50
[ 3003.499153] rest_init+0xf0/0xf8
[ 3003.499155] arch_call_rest_init+0x18/0x20
[ 3003.499158] start_kernel+0x534/0x6b0
[ 3003.499160] __primary_switched+0xbc/0xd0
[ 3003.499162] handlers:
[ 3003.499163] [<00000000b7753760>] irq_default_primary_handler threaded [<0000000000864658>] mcp251xfd_irq
[ 3003.499170] Disabling IRQ #168
[ 3060.924034] CAN Heartbeat: Enabled state changed to 0
The following suggested patch seems promising - we will work on this and report back
--- a/drivers/net/can/dev/dev.c
+++ b/drivers/net/can/dev/dev.c
@@ -185,7 +185,9 @@ static void can_restart_work(struct work_struct *work)
struct can_priv *priv = container_of(dwork, struct can_priv,
restart_work);
+ netif_tx_lock(priv->dev);
can_restart(priv->dev);
+ netif_tx_unlock(priv->dev);
}
int can_restart_now(struct net_device *dev)
@rmerrill-fw has this been resolved on your end?
@bhjelstrom we're working around it by triggering restart in userspace instead of kernel
I am also facing oops on the 3 Raspberry Pi 4 with RaspberryPi Os (Bookworm, kernel 6.12.47) when there is lots of traffic . There hasn't been any progress on the issue has there? Has anybody been able to find the issue or circumvent it by using a different kernel version? If it helps I can share some more logs, as it occurs quite often in our setup.
@bhjelstrom have you had success with the patch in your last comment?
I think I have the same issue after upgrading my RPi5 up to last Raspbian with all simple default settings. Thats happens only when a lot of transfer via CAN bus
Linux dev 6.12.47+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.12.47-1+rpt1 (2025-09-16) aarch64 GNU/Linux
PRETTY_NAME="Raspbian GNU/Linux 13 (trixie)"
NAME="Raspbian GNU/Linux"
VERSION_ID="13"
VERSION="13 (trixie)"
VERSION_CODENAME=trixie
DEBIAN_VERSION_FULL=13.1
Seems like I found the problem. I have background task with broadcasting a simple CAN-bus message every 200ms. And it cause this issue when starts another one thread with ISO-TP messaging...
Here is my code which do broadcasting and bus off
broadcast_process = multiprocessing.Process(target=start_can_broadcast_manager)
broadcast_process.start()
def start_can_broadcast_manager():
"""Broadcast VIN messages every second on all CAN interfaces (without BCM)."""
vin_hex = bytes(ecu_config.get_vin(), 'ascii')
msg1 = can.Message(
arbitration_id=0x666,
data=[0x00, vin_hex[0], vin_hex[1], vin_hex[2], vin_hex[3], vin_hex[4], vin_hex[5], vin_hex[6]],
is_extended_id=False,
)
msg2 = can.Message(
arbitration_id=0x666,
data=[0x01, vin_hex[7], vin_hex[8], vin_hex[9], vin_hex[10], vin_hex[11], vin_hex[12], vin_hex[13]],
is_extended_id=False,
)
msg3 = can.Message(
arbitration_id=0x666,
data=[0x02, vin_hex[14], vin_hex[15], vin_hex[16], 0x00, 0x00, 0x00, 0x00],
is_extended_id=False,
)
can_interfaces = ["can0", "can1", "can2"]
buses = []
# Open all CAN interfaces
for iface in can_interfaces:
try:
bus = can.interface.Bus(channel=iface, bustype="socketcan")
buses.append(bus)
logger_app.logger.info(f"Opened CAN interface: {iface}")
except Exception as e:
logger_app.logger.error(f"Failed to open CAN {iface}: {e}")
try:
while True:
for bus in buses:
try:
bus.send(msg1)
bus.send(msg2)
bus.send(msg3)
except Exception as e:
logger_app.logger.warning(f"Failed to send on {bus.channel_info}: {e}")
time.sleep(1)
except KeyboardInterrupt:
logger_app.logger.info("Broadcast manager stopped by user.")
finally:
for bus in buses:
bus.shutdown()
logger_app.logger.info("All CAN interfaces closed.")
I am also facing oops on the 3 Raspberry Pi 4 with RaspberryPi Os (Bookworm, kernel 6.12.47) when there is lots of traffic . There hasn't been any progress on the issue has there? Has anybody been able to find the issue or circumvent it by using a different kernel version? If it helps I can share some more logs, as it occurs quite often in our setup.
@bhjelstrom have you had success with the patch in your last comment?
@tom-st-pm We landed on a user-space watchdog just like @rmerrill-fw