SPI overflow in CAN bus HAT communication - IRQ handler mcp251xfd_handle_tefif()
Describe the bug
Hi everyone,
I'm encountering an issue with the IRQ handler. When I simulate multiple periodic messages over the CAN bus, at a certain point, I suspect that SPI communication fails to properly manage data exchange with the two HAT modules(https://www.waveshare.com/wiki/2-CH_CAN_FD_HAT) I am using.
I updated the kernel to the latest version (see below), which seemed to improve the situation slightly. However, it only mitigated the buffer issue by delaying the occurrence of the fault rather than fully resolving it.
My setup consists of a Raspberry Pi 5 with two CAN HAT 2CH FD modules, each using the MCP251XFD chip.
To ensure the problem is related to the transmitter and not the receiver, I tested two different scenarios:
-
Using two channels of the Raspberry Pi to transmit and the remaining two channels to receive (all configured with the same parameters: bitrate, data rate, and sampling time).
-
Using two channels of the Raspberry Pi to transmit and two channels of a Vector VN1640 (with the 1057Gcap installed) to receive. Also in this case, I carefully verified the CAN configuration.
In both cases the termination are correctly verified, providing 60 ohms in the buses.
I am aware that I am stressing the module, as I am sending multiple periodic messages (128 periodic messages per CAN FD, generating a 40% bus load). The messages require some processing to calculate the internal CRC and MC in the payload, which is correctly handled by the Python code.
Based on the information provided, the issue appears to be related to overrun packets. I suspect this is the cause of the failure.
ifconfig:
can0: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 6455202 bytes 68902927 (65.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6455202 bytes 68902927 (65.7 MiB)
TX errors 0 dropped 0 overruns 8471 carrier 0 collisions 0
device interrupt 186
can1: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 13736158 bytes 146632299 (139.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 13736158 bytes 146632299 (139.8 MiB)
TX errors 0 dropped 0 overruns 11052 carrier 0 collisions 0
device interrupt 187
can2: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 13736162 bytes 146632347 (139.8 MiB)
RX errors 0 dropped 0 overruns 3 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 188
can3: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536
(UNSPEC)
RX packets 6455206 bytes 68902975 (65.7 MiB)
RX errors 0 dropped 0 overruns 1 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 189
The information I provided above is essentially the same as in the second case, where I use the Vector hardware as the receiver. One observation I have made is that after a bus failure, the overruns seem to stop or decrease. However, after a certain period (several hours), at some point, the second bus also fails.
Thanks in advance
Steps to reproduce the behaviour
I can't find a piece of code that fully replicates the bug. I wrote a program that sends periodic messages, which generates some overruns, but not as many as my main code.
I use Bluetooth serial communication to share information about the message I want to simulate, including the ID, initial payload, and CAN bus settings. This communication is not constant.
I could try running the code without using Bluetooth, but I still need to work on it.
How I Initialize the bus:
self.canDB[canName] = SocketcanBus(
channel=canName,
fd=True if isFD == 'True' else False,
receive_own_messages=False,
local_loopback=False,
can_filters=None,
ignore_rx_error_frames=False
)
To run a periodic message, I create a thread using the library function:
send_periodic(self.messageInfo[_bus][_id]['MSG_CLASS'], self.messageInfo[_bus][_id]['PERIOD'], modifier_callback=self.messageInfo[_bus][_id]['CALLBACK'], on_error=self.messageInfo[_bus][_id]['ERROR'], autostart=False)
The message object is defined as follow:
can.Message(arbitration_id=_id, data=payload, dlc=_dlc, is_extended_id=_id_extended, is_fd=fdParam, bitrate_switch=fdParam)
Device (s)
Raspberry Pi 5
System
uname -a:
Linux raspberrypi 6.6.74+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.74-1+rpt1
(2025-01-27) aarch64 GNU/Linux
modinfo can:
filename: /lib/modules/6.6.74+rpt-rpi-2712/kernel/net/can/can.ko.xz
alias: net-pf-29
author: Urs Thuermann <[email protected]>, Oliver Hartkopp
<[email protected]>
license: Dual BSD/GPL
description: Controller Area Network PF_CAN core
srcversion: 37BF79470A254916FD8C595
depends:
intree: Y
name: can
vermagic: 6.6.74+rpt-rpi-2712 SMP preempt mod_unload modversions aarch64
parm: stats_timer:enable timer for statistics (default:on) (int)
modinfo mcp251xfd:
filename:
/lib/modules/6.6.74+rpt-rpi-2712/kernel/drivers/net/can/spi/mcp251xfd/mcp251xfd.ko.xz
license: GPL v2
description: Microchip MCP251xFD Family CAN controller driver
author: Marc Kleine-Budde <[email protected]>
srcversion: 672D5B0159EB5AA465D4A79
alias: spi:mcp251xfd
alias: spi:mcp251863
alias: spi:mcp2518fd
alias: spi:mcp2517fd
alias: of:N*T*Cmicrochip,mcp251xfdC*
alias: of:N*T*Cmicrochip,mcp251xfd
alias: of:N*T*Cmicrochip,mcp251863C*
alias: of:N*T*Cmicrochip,mcp251863
alias: of:N*T*Cmicrochip,mcp2518fdC*
alias: of:N*T*Cmicrochip,mcp2518fd
alias: of:N*T*Cmicrochip,mcp2517fdC*
alias: of:N*T*Cmicrochip,mcp2517fd
depends: can-dev
intree: Y
name: mcp251xfd
vermagic: 6.6.74+rpt-rpi-2712 SMP preempt mod_unload modversions aarch64
Logs
dmesg | grep can:
[ 3.165808] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -PLL
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz
m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 3.170311] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN
+CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz
rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 3.179808] mcp251xfd spi1.1 can2: MCP2518FD rev0.0 (-RX_INT -PLL
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz
m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 3.186278] mcp251xfd spi1.0 can3: MCP2518FD rev0.0 (-RX_INT -PLL
-MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz
m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 112.788097] can: controller area network core
[ 112.795770] can: raw protocol
[ 2339.305706] mcp251xfd spi0.1 can0: IRQ handler mcp251xfd_handle_tefif() returned -22.
[ 2339.305714] mcp251xfd spi0.1 can0: IRQ handler returned -22 (intf=0x3f1a0014).
[ 4809.135009] mcp251xfd spi0.0 can1: IRQ handler mcp251xfd_handle_tefif() returned -22.
[ 4809.135017] mcp251xfd spi0.0 can1: IRQ handler returned -22 (intf=0x3f1a0014).
Additional context
@marckleinebudde
Thanks in advance
Attached is a code snippet that seems to replicate the bug (in my setup). The longer it runs, the more overflows occur, although I'm not 100% sure, as it's difficult to gather statistics.
import can
import time
import random
from can.interfaces.socketcan import SocketcanBus
import os
period_choices = [0.01, 0.05, 0.1, 0.5, 1]
class CANHandler:
def __init__(self, interface1='can0', interface2='can3'):
self.interface1 = interface1
self.interface2 = interface2
self.messageInfo = {self.interface1: {}, self.interface2: {}}
def setup_can_interface(self, interface):
os.system(f"sudo ip link set down {interface}")
time.sleep(0.5)
os.system(f"sudo ip link set {interface} up type can bitrate 500000 sample-point 0.75 dbitrate 2000000 dsample-point 0.75 restart-ms 1000 berr-reporting on fd on")
time.sleep(0.5)
os.system(f"sudo ip link set {interface} txqueuelen 65536")
time.sleep(0.5)
def send_can_messages(self):
self.setup_can_interface(self.interface1)
self.setup_can_interface(self.interface2)
self.bus0 = can.interfaces.socketcan.SocketcanBus(channel=self.interface1, fd=True)
self.bus1 = can.interfaces.socketcan.SocketcanBus(channel=self.interface2, fd=True)
for i in range(1, 120):
_dlc = random.randint(8, 32)
_id = i
data = [random.randint(0, 255) for _ in range(_dlc)]
period = random.choice(period_choices)
msg_class = can.Message(arbitration_id=_id, data=bytearray(data), is_fd=True, dlc=_dlc, is_extended_id=False, bitrate_switch=True)
self.messageInfo[self.interface1][i] = {
'MSG_CLASS': msg_class,
'PERIOD': period,
}
self.messageInfo[self.interface2][i] = {
'MSG_CLASS': msg_class,
'PERIOD': period,
}
self.messageInfo[self.interface1][i]['THREAD'] = self.bus0.send_periodic(
self.messageInfo[self.interface1][i]['MSG_CLASS'],
self.messageInfo[self.interface1][i]['PERIOD'],
autostart=True
)
time.sleep(0.01)
self.messageInfo[self.interface2][i]['THREAD'] = self.bus1.send_periodic(
self.messageInfo[self.interface2][i]['MSG_CLASS'],
self.messageInfo[self.interface2][i]['PERIOD'],
autostart=True
)
time.sleep(1000)
self.bus0.shutdown()
self.bus1.shutdown()
if __name__ == "__main__":
can_handler = CANHandler()
can_handler.send_can_messages()
Can you please copy https://github.com/linux-can/can-utils/tree/master/mcp251xfd/99-devcoredump.rules to /etc/udev/rules.d and https://github.com/linux-can/can-utils/blob/master/mcp251xfd/devcoredump to /usr/sbin
If the IRQ handler fails with an error, the driver will generate a dump of the driver and chip state and write it to /var/log/devcoredump-*.dump. Please send me this file.
i also discovered this problem. I'm on the latest available kernel
Linux robomaster 6.13.0-v8-16k+ #1 SMP PREEMPT_RT Sat Feb 1 22:01:29 CET 2025 aarch64 GNU/Linux
it dont happens every time, it randomly occurs. also the driver feels a bit unstable currently. in the most of the time I randomly face some "timeouts" where im unable to send or read. but not every time I get IRQ handler fails.
Can you please copy https://github.com/linux-can/can-utils/tree/master/mcp251xfd/99-devcoredump.rules to
/etc/udev/rules.dand https://github.com/linux-can/can-utils/blob/master/mcp251xfd/devcoredump to/usr/sbinIf the IRQ handler fails with an error, the driver will generate a dump of the driver and chip state and write it to
/var/log/devcoredump-*.dump. Please send me this file.
Did it without success, no file has been generated, but the IRQ handler message is present in dmesg | grep can
filogold@raspberrypi:/etc/udev/rules.d $ ls
99-com.rules 99-devcoredump.rules
filogold@raspberrypi:/usr/sbin $ ls
aa-remove-unknown grpck pppoe-discovery
aa-status grpconv pppstats
aa-teardown grpunconv pwck
accessdb halt pwconv
addgnupghome hwclock pwhistory_helper
addgroup i2cdetect pwunconv
add-shell i2cdump rarp
adduser i2cget readprofile
agetty i2cset reboot
alsabat-test i2c-stub-from-dump remove-default-ispell
alsactl i2ctransfer remove-default-wordlist
alsa-info iconvconfig remove-shell
apparmor_parser ifconfig request-key
apparmor_status iio-sensor-proxy resize2fs
applygnupgdefaults init resolvconf
arp insmod rfkill
arpd installkernel rmmod
aspell-autobuildhash install-sgmlcatalog rmt
avahi-daemon invoke-rc.d rmt-tar
badblocks ip route
blkdeactivate ipmaddr rpcbind
blkdiscard ippevepcl rpcctl
blkid ippeveprinter rpcdebug
blkmapd ippeveps rpc.gssd
blkzone ipp-usb rpc.idmapd
blockdev iptunnel rpcinfo
bluetoothd isosize rpc.statd
bridge ispell-autobuildhash rpc.svcgssd
capsh iw rtacct
cfdisk iwconfig rtcwake
cgdisk iwevent rtkitctl
chat iwgetid rtmon
chcpu iwlist runlevel
chgpasswd iwpriv runuser
chmem iwspy saned
chpasswd kbdrate select-default-ispell
chroot key.dns_resolver select-default-wordlist
cifs.idmap killall5 service
cifs.upcall ldattach setcap
cpgr ldconfig setvesablank
cppw lightdm sfdisk
cron lightdm-gtk-greeter sgdisk
ctrlaltdel locale-gen shadowconfig
cupsaccept logrotate showmount
cups-browsed logsave shutdown
cupsctl losetup slattach
cupsd lpadmin sm-notify
cupsdisable lpinfo sshd
cupsenable lpmove start-statd
cupsfilter lsmod start-stop-daemon
cupsreject make-ssl-cert sudo_logsrvd
dcb mii-tool sudo_sendlog
debugfs mkdosfs sulogin
delgroup mke2fs swaplabel
deluser mkfs swapoff
depmod mkfs.bfs swapon
devcoredump mkfs.cramfs switch_root
devlink mkfs.exfat sysctl
dhclient mkfs.ext2 tarcat
dhclient-script mkfs.ext3 tc
dmidecode mkfs.ext4 telinit
dmsetup mkfs.fat th-cmd
dmstats mkfs.minix thd
dnsmasq mkfs.msdos tipc
dosfsck mkfs.ntfs tune2fs
dosfslabel mkfs.vfat tune.exfat
dphys-swapfile mkhomedir_helper umount.nfs
dpkg-preconfigure mkinitramfs umount.nfs4
dpkg-reconfigure mklost+found umount.udisks2
dumpe2fs mkntfs unix_chkpwd
dump.exfat mkswap unix_update
e2freefrag ModemManager update-ca-certificates
e2fsck modinfo update-catalog
e2image modprobe update-default-aspell
e2label mount.cifs update-default-ispell
e2mmpstatus mount.fuse update-default-wordlist
e2scrub mount.fuse3 update-dictcommon-aspell
e2scrub_all mount.lowntfs-3g update-dictcommon-hunspell
e2undo mount.nfs update-fonts-alias
e4crypt mount.nfs4 update-fonts-dir
e4defrag mount.ntfs update-fonts-scale
ethtool mount.ntfs-3g update-gsfontmap
exfat2img mount.smb3 update-icon-caches
exfatlabel mountstats update-inetd
faillock nameif update-initramfs
fake-hwclock NetworkManager update-java-alternatives
fatlabel newusers update-locale
fdisk nfsconf update-mime
filefrag nfsidmap update-passwd
findfs nfsiostat update-pciids
fixparts nfsstat update-rc.d
flashrom nft update-shells
fsck nologin update-xmlcatalog
fsck.cramfs ntfsclone usb_modeswitch
fsck.exfat ntfscp usb_modeswitch_dispatcher
fsck.ext2 ntfslabel useradd
fsck.ext3 ntfsresize userdel
fsck.ext4 ntfsundelete usermod
fsck.fat pam-auth-update v4l2-dbg
fsck.minix pam_getenv validlocale
fsck.msdos pam_namespace_helper vcstime
fsck.vfat pam_timestamp_check vdpa
fsfreeze paperconfig vigr
fstab-decode parted vipw
fstrim partprobe visudo
gdisk pi-greeter wayvnc-control.py
genl pivot_root wayvnc-generate-keys.sh
getcap pkla-admin-identities wayvnc-run.sh
getpcaps pkla-check-authorization wipefs
getty plipconfig wpa_action
gnome-menus-blacklist plymouthd wpa_cli
groupadd plymouth-set-default-theme wpa_supplicant
groupdel poweroff zic
groupmems pppd zramctl
groupmod pppdump
filogold@raspberrypi:/usr/sbin $ ifconfig
can1: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536 (UNSPEC)
RX packets 6682736 bytes 79431951 (75.7 MiB)
RX errors 42 dropped 0 overruns 0 frame 0
TX packets 6682778 bytes 79432515 (75.7 MiB)
TX errors 0 dropped 0 overruns 96 carrier 0 collisions 0
device interrupt 187
can2: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536 (UNSPEC)
RX packets 6906321 bytes 81185955 (77.4 MiB)
RX errors 6 dropped 0 overruns 0 frame 0
TX packets 6906327 bytes 81186051 (77.4 MiB)
TX errors 0 dropped 0 overruns 9529 carrier 0 collisions 0
device interrupt 188
can3: flags=193<UP,RUNNING,NOARP> mtu 72
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 65536 (UNSPEC)
RX packets 6733854 bytes 79508518 (75.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6733854 bytes 79508518 (75.8 MiB)
TX errors 0 dropped 0 overruns 9718 carrier 0 collisions 0
device interrupt 189
filogold@raspberrypi:~ $ dmesg | grep can
[ 2.765163] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 2.771719] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 2.778191] mcp251xfd spi1.1 can2: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 2.783010] mcp251xfd spi1.0 can3: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[ 138.971928] can: controller area network core
[ 138.979074] can: raw protocol
[ 2299.725953] mcp251xfd spi1.0 can3: IRQ handler mcp251xfd_handle_tefif() returned -22.
[ 2299.725964] mcp251xfd spi1.0 can3: IRQ handler returned -22 (intf=0xbf1a0014).
Did it without success, no file has been generated, but the IRQ handler message is present in dmesg | grep can
Did you make devcoredump executable?
chmod +x /usr/sbin/devcoredump
...and you have to restart the system, after you initially copied 99-devcoredump.rules.
...and you have to restart the system, after you have to restart the system after you initially copied
99-devcoredump.rules.
Sorry guys, my bad. I still need to learn how to properly handle Linux.
Upload the file here @marckleinebudde
You can upload files here, too. For reference: devcoredump-20250203-234629.dump.gz
i also discovered this problem. I'm on the latest available kernel
Linux robomaster 6.13.0-v8-16k+ #1 SMP PREEMPT_RT Sat Feb 1 22:01:29 CET 2025 aarch64 GNU/Linuxit dont happens every time, it randomly occurs. also the driver feels a bit unstable currently. in the most of the time I randomly face some "timeouts" where im unable to send or read. but not every time I get IRQ handler fails.
Can you explain your HW setup. Which raspi? Which mcp2518fd connected to which SPI and CS?
Can you explain you HW setup. Which raspi? Which mcp2518fd connected to which SPI and CS?
Yes sure. Im currently using a raspberry pi 5 with PiCAN FD HAT. Its configured like described in their instructions
dtparam=spi=on
dtoverlay=mcp251xfd,spi0-0,interrupt=25
I'm currently only using classic CAN and not FD with 1Mhz and about 40% of load on the bus. As mentioned i not always face the IRQ issue. But sometimes short dropouts and increasing overruns (TX and RX) the longer it runs. Sometimes during dropouts I see crazy amount of overruns.
@Vinz1911, do you have something else attached to the SPI?
No, i just have this HAT attached. Additionally i use the onboard bluetooth and i have a display over i2c (small oled 128x32). No USB devices or something else.
You can rule out a class of Pi 5-specific problems by installing the patch in pull request #6646: sudo rpi-update pulls/6646
Back up any important data first.
unfortunately did not solve my problem.
can0: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 2500 (UNSPEC)
RX packets 309358 bytes 2194456 (2.0 MiB)
RX errors 2 dropped 5372 overruns 27 frame 0
TX packets 58978 bytes 398433 (389.0 KiB)
TX errors 0 dropped 0 overruns 2 carrier 0 collisions 0
device interrupt 186
i also discovered this happens or is much more reproducible, after a fresh start of the raspberry pi. then my program runs like a half minute then I get a lot of overruns and im unable to read or write on the CANBus. restarting the program then works, I would say, as expected but still slowly increasing the overrun count for RX and TX. But since your patch, it seems, the overruns increase are much less
That's a shame - I'll just observe while Marc gets on with it.
I did some additional testing today and found a way to reliable reproduce the issue that appears for me. My program reads and writes on the CANBus and ive configured in socketCAN a read timeout of 100ms and my program stops after 3 failed read attempts.
This Hiccup, when the Pi is unable to interact with the CANBus, only appears after a cold boot (also warm boot) (pi without any power source connected). When the Pi has finished booting i enable the interface with:
sudo ip link set can0 up type can bitrate 1000000 restart-ms 1000 berr-reporting on fd off
sudo ifconfig can0 txqueuelen 2500
it takes around 20 seconds then something happens and the hiccup occurs, then my program exits because its unable to read or write. restarting the program then works and runs without "any" issues (overruns count increase slightly but seems not to affect my program). When I do a reboot also everything works fine and the issue does not appear again. Except I do a cold/warm boot. removing the power source so the Pi is completely turned off and turn it on again, then the hiccup appears once again and then it works like described.
After the hiccup it looks mostly like this (overruns and error count varies a lot):
can0: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 2500 (UNSPEC)
RX packets 151705 bytes 1076459 (1.0 MiB)
RX errors 20 dropped 5989 overruns 0 frame 0
TX packets 33412 bytes 226254 (220.9 KiB)
TX errors 0 dropped 0 overruns 20 carrier 0 collisions 0
device interrupt 171
I dont see any issues when using dmesg. Sometimes the interface becomes completely unreachable by also having the IRQ issue. But this is not reliable reproducible by me. Maybe there is not enough load on the CANBus to trigger that issue.
After a reboot and without the hiccup and the latest patch it mostly look like this:
can0: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 2500 (UNSPEC)
RX packets 1679156 bytes 11911832 (11.3 MiB)
RX errors 0 dropped 7049 overruns 0 frame 0
TX packets 306671 bytes 2066722 (1.9 MiB)
TX errors 0 dropped 0 overruns 1 carrier 0 collisions 0
device interrupt 186
It seems something strange happens after enabling the interface for the first time after a cold/warm boot which leads in this hiccup while the pi is interacting with the CAN HAT.
EDIT: i forgot to mention that the CANBus is not restarted. I also attached an additional raspberry pi with a Waveshare CAN HAT (MCP2515) to check if something weird happens or the entire Bus gets stuck during the hiccup appears, but this is not the case.
You can rule out a class of Pi 5-specific problems by installing the patch in pull request #6646:
sudo rpi-update pulls/6646Back up any important data first.
Tried with this Kernel version, still same behavior.
Good to know - it was worth a try.
The community and me don't see these kind of errors on non Raspi-5 systems, even the Raspi-4 is good. Maybe the SPI host driver is more optimized and triggers a race condition in the mcp251xfd driver. I'm on holidays next week and I think I don't find time to look into this issue before.
Timing will certainly be different. I doubt it will be more optimised though - the Synopsys had several missing features, some of which I've added, and has received much less attention than the highly polished bcm2835 equivalent. I also wouldn't rule out a real bug.
Update @marckleinebudde: After speaking with the seller, they suggested downgrading the kernel to version 6.6.42 (released on July 25th) with the commit hash d29744869d42eff651176274402cce7c69932f52.
I'm currently running some tests, and so far, the overrun messages seem to have disappeared.
I believe it's best to keep the discussion open since the issue is not resolved in the newer version.
Thanks for the hint @filogold i will also try this Kernel Version and check if this resolves my problem as well.
@filogold Where can I find d29744869d42eff651176274402cce7c69932f52?
@filogold Where can I find d29744869d42eff651176274402cce7c69932f52?
https://github.com/raspberrypi/firmware/commit/d29744869d42eff651176274402cce7c69932f52
EDIT: idk why but you have to copy and paste the link on a new tab in order to see the webpage
I simply installed using the command:
rpi-update d29744869d42eff651176274402cce7c69932f52
can you check again @filogold, something seems odd. I tried it but seems the hash does not exist.
pi@raspberry:~$ sudo rpi-update d29744869d42eff651176274402cce7c69932f52
*** Raspberry Pi firmware updater by Hexxeh, enhanced by AndrewS and Dom
*** Performing self-update
*** Relaunching after update
*** Raspberry Pi firmware updater by Hexxeh, enhanced by AndrewS and Dom
FW_REV:
BOOTLOADER_REV:8eef29aed99de9af74e79542309d6106d16abd12
*** Invalid hash given
rpi-update takes hashes from the rpi-firmware repo, not from firmware.
https://github.com/raspberrypi/firmware/commit/d29744869d42eff651176274402cce7c69932f52 would be https://github.com/raspberrypi/rpi-firmware/commit/1969fd7d6748f4977b1802c2b7caa7b1d3a334fb
sudo rpi-update 1969fd7d6 should do the trick.
@marckleinebudde I did also some testing now. The "weird" hiccup, I described a few posts ago, did not disappear for me but the overruns did. Also here no increasing overrun count for RX/TX anymore with 6.6.42 - 1969fd7d6 .
... would be raspberrypi/rpi-firmware@1969fd7
The git commit hash in raspberrypi/linux tree is bfbd468e81defa70d4356a1a3e467cc5d67ca89b (I got it from git_hash file from the rpi-firmware repo link above)
https://github.com/raspberrypi/linux/tree/bfbd468e81defa70d4356a1a3e467cc5d67ca89b
That's the commit before erratum DS80000789E 6 is addressed.
any updates on this? @marckleinebudde