linux-rpi
linux-rpi copied to clipboard
mcp25xxfd reports errors in the kernel log
Hi,
I'm evaluating the PICAN FD Duo board from Skpang. To test it out, I've connected can0 to can1 and running cangen -g 0 -m can0. After a short period of time, the command will exit with a message write: No buffer space available.
Kernel:
root@signalserver-canfd:~# uname -a Linux signalserver-canfd 4.14.98ms7-v7+ #75 SMP Sat Feb 16 22:22:33 UTC 2019 armv7l GNU/Linux
Looking in the kernel log I can see the following messages:
[ 323.947188] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[ 323.963571] IPv6: ADDRCONF(NETDEV_CHANGE): can1: link becomes ready
[ 343.344348] can: controller area network core (rev 20170425 abi 9)
[ 343.344444] NET: Registered protocol family 29
[ 343.352870] can: raw protocol (rev 20170425)
[ 343.361683] NOHZ: local_softirq_pending 08
[ 343.363120] NOHZ: local_softirq_pending 08
[ 343.366264] NOHZ: local_softirq_pending 08
[ 343.367370] NOHZ: local_softirq_pending 08
[ 343.368599] NOHZ: local_softirq_pending 08
[ 343.369537] NOHZ: local_softirq_pending 08
[ 343.370786] NOHZ: local_softirq_pending 08
[ 343.374552] NOHZ: local_softirq_pending 08
[ 343.375805] NOHZ: local_softirq_pending 08
[ 343.376777] NOHZ: local_softirq_pending 08
[ 402.700957] mcp25xxfd spi0.1 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
[ 402.701055] mcp25xxfd spi0.1 can0: tefif: fifo 1 not pending - tef data: id: 0b066825 flags: 00000217, ts: 04947926 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 402.702035] mcp25xxfd spi0.1 can0: tefif: fifo 2 not pending - tef data: id: 03c34c04 flags: 00000435, ts: 04947d64 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 402.703176] mcp25xxfd spi0.1 can0: tefif: fifo 3 not pending - tef data: id: 00000479 flags: 0000068b, ts: 04948196 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 402.703214] mcp25xxfd spi0.1 can0: tefif: fifo 4 not pending - tef data: id: 14d04342 flags: 0000089c, ts: 049485bf - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 402.704475] mcp25xxfd spi0.1 can0: tefif: fifo 5 not pending - tef data: id: 00000253 flags: 00000ac0, ts: 049489d3 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 402.704512] mcp25xxfd spi0.1 can0: tefif: fifo 6 not pending - tef data: id: 177f51db flags: 00000c9e, ts: 04948e1a - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 402.704546] mcp25xxfd spi0.1 can0: tefif: fifo 7 not pending - tef data: id: 041ee9fb flags: 00000e15, ts: 04949245 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
I've tried different values for -g, but it just takes slightly longer to fail. Also I've tried to half spimaxfrequency following the suggestion in the kernel log.
this may be due to some bugs - please try the latest version I have just published
No, this is still happening:
[ 156.474627] can: controller area network core (rev 20170425 abi 9)
[ 156.474726] NET: Registered protocol family 29
[ 156.484418] can: raw protocol (rev 20170425)
[ 156.488317] NOHZ: local_softirq_pending 08
[ 156.488422] NOHZ: local_softirq_pending 08
[ 156.491721] NOHZ: local_softirq_pending 08
[ 156.491800] NOHZ: local_softirq_pending 08
[ 156.491874] NOHZ: local_softirq_pending 08
[ 156.491919] NOHZ: local_softirq_pending 08
[ 156.492017] NOHZ: local_softirq_pending 08
[ 156.492044] NOHZ: local_softirq_pending 08
[ 156.492169] NOHZ: local_softirq_pending 08
[ 156.492294] NOHZ: local_softirq_pending 08
[ 169.584554] mcp25xxfd spi0.1 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
[ 169.584676] mcp25xxfd spi0.1 can0: tefif: fifo 6 not pending - tef data: id: 081f6bd7 flags: 00000c18, ts: 04485841 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 169.585603] mcp25xxfd spi0.1 can0: tefif: fifo 1 not pending - tef data: id: 10d8d200 flags: 000002dd, ts: 04485c89 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 248.568010] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 248.568151] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 248.568219] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 248.570112] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 248.570152] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[1]+ Killed cangen -g 1 -m can0
root@signalserver-canfd:~# uname -a
Linux signalserver-canfd 4.19.40ms7-v7+ #81 SMP Sat May 11 17:33:45 UTC 2019 armv7l GNU/Linux
I'd be happy to help out debugging or providing more information to help you solve this issue.
I would need a bit more details!
Things like: spi clock speed configured Description of test setup Description of test - how are you testing? Description of can bus parameters (baud,...) Output of head /sys/kernel/debug/mcp25xxfd*/spi* Output of ip -details link show can0 Output of uname -a Version of code - is it really 8.2?
Other things that may be of interest - some that i can not think of right now
Looking at it some more I came to realize:
You seem to be using:
- a Raspberry Pi 3*
- using spi 0.1 - so i guess that that you use spi-bcm2835 with (at least) 2 chip selects
It leaves open some other questions...
Configuration file: config.txt
File I use to set up interfaces: ifup-canfd.txt
Output from /sys/kernel/debug/mcp25xxfd*/spi* as you requested:
root@signalserver-canfd:~# find /sys/kernel/debug/mcp25xxfd-spi0.0/ -name "spi_*" -exec echo "{}:" \; -exec cat "{}" \;
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_crc_read_split:
0
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_crc_read:
1
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_use_speed_hz:
20000000
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_normal_speed_hz:
20000000
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_setup_speed_hz:
10000000
root@signalserver-canfd:~# find /sys/kernel/debug/mcp25xxfd-spi0.1/ -name "spi_*" -exec echo "{}:" \; -exec cat "{}" \;
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_crc_read_split:
0
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_crc_read:
1
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_use_speed_hz:
20000000
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_normal_speed_hz:
20000000
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_setup_speed_hz:
10000000
Output from ip -details as you requested:
root@signalserver-canfd:~# ip -details link show type can
3: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/can promiscuity 0
can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
bitrate 500000 sample-point 0.800
tq 25 prop-seg 31 phase-seg1 32 phase-seg2 16 sjw 1
mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
dbitrate 2000000 dsample-point 0.800
dtq 25 dprop-seg 7 dphase-seg1 8 dphase-seg2 4 dsjw 1
mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
clock 40000000numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
4: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/can promiscuity 0
can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
bitrate 500000 sample-point 0.800
tq 25 prop-seg 31 phase-seg1 32 phase-seg2 16 sjw 1
mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
dbitrate 2000000 dsample-point 0.800
dtq 25 dprop-seg 7 dphase-seg1 8 dphase-seg2 4 dsjw 1
mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
clock 40000000numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Output from uname -a:
root@signalserver-canfd:~# uname -a
Linux signalserver-canfd 4.19.40ms7-v7+ #81 SMP Sat May 11 17:33:45 UTC 2019 armv7l GNU/Linux
Not sure how to show you that I really run your software. I pulled the kernel directly from your release page. :)
Image of test setup:

Reproducing the error:
cangen -g 1 -m can0 & candump can1
Tell me if you need anything else.
From config.txt:
dtparam=i2c_arm=on
dtparam=i2c_vc=on
dtparam=spi=on
dtoverlay=mcp2517fd-can0,oscillator=40000000,interrupt=25
dtoverlay=mcp2517fd-can1,oscillator=40000000,interrupt=05
That can1 is tied to interrupt 05 seems a little bit odd, but it's according to instructions from SkPang. On the previous version of the PICAN DUO shield the interrupts for the can interfaces were in sequence.
Please let me know if you're able to reproduce the error or if I could assist in isolating the bug further.
I was having the same issue and as a workaround I reverted those two commits that optimize TEF reading https://github.com/msperl/linux-rpi/commit/d86b4846d307e958ad3b559644654cad9316c3c5#diff-8a495b05736235e49b1a602651e1a187 https://github.com/msperl/linux-rpi/commit/65c507a82245721c5473c1875e3bfec1a36b29ec#diff-8a495b05736235e49b1a602651e1a187
The problems (hopefully) seems to be gone now without those two commits, however I didn't dig into it much further.
Thanks @jlanger, I will try to do the same when I figure out how to build the kernel.
I recompiled the kernel from commit cf733ff8b7a570f640169ab9fb577c65d4e7181f just before the TEF optimizations as just reverting the commits resulted in some compile errors that I didn't care to investigate.
It looks promising. I'm able to run cangen -g 1 -m can0 & candump can1 in my setup without the error reoccurring.
...after a while I get this message though:
write: No buffer space available
But it could be just a matter of increasing txqueuelen.
Edit: When running the code from this commit, with increased txqueuelen I get a kernel panic. I can post it on request, but because it's not the latest there might be a fix in a later commit.
I also did some more testing and it specifically seems to be this commit https://github.com/msperl/linux-rpi/commit/65c507a82245721c5473c1875e3bfec1a36b29ec#diff-8a495b05736235e49b1a602651e1a187 That's making trouble. The first TEF optimization commit seems to work fine.
@enmasse: Yes if you revert that commit you need to re-add some variable definitions in mcp25xxfd_can_debugfs.c to make it build again.
@jlanger, it depends on what you mean when you say first or second commit, they're dated in the opposite order of when they're commited. :)
I reverted only commit 65c507a82245721c5473c1875e3bfec1a36b29ec but then I still get the error.
Even with the two TEF commits reverted I was able to reproduce the kernel panic.
Could this be related to the issue described in the MCP2517FD Errata?
TX MAB underflow due to long delays between SPI bytes In case of a TX MAB underflow, the device willnotify the application by setting SERRIF and MODIF and by transitioning to Restricted Opera-tion or Listen Only mode (depending on CiCON.SERR2LOM
We tested with a board using the MCP2518FD where the errata of the MCP2517 have been fixed. This did not solve the issue with the "Something went wrong - we got a TEF..." error.
I also have same issue on mcp2517fd and rpi3. @msperl do you have any update to solve this issue.
Even with the two TEF commits reverted I was able to reproduce the kernel panic.
do you have already solve this issue? can frame stop and need reboot
Hi,
I'm evaluating the PICAN FD Duo board from Skpang. To test it out, I've connected can0 to can1 and running
cangen -g 0 -m can0. After a short period of time, the command will exit with a messagewrite: No buffer space available.Kernel:
root@signalserver-canfd:~# uname -a Linux signalserver-canfd 4.14.98ms7-v7+ #75 SMP Sat Feb 16 22:22:33 UTC 2019 armv7l GNU/LinuxLooking in the kernel log I can see the following messages:
[ 323.947188] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready [ 323.963571] IPv6: ADDRCONF(NETDEV_CHANGE): can1: link becomes ready [ 343.344348] can: controller area network core (rev 20170425 abi 9) [ 343.344444] NET: Registered protocol family 29 [ 343.352870] can: raw protocol (rev 20170425) [ 343.361683] NOHZ: local_softirq_pending 08 [ 343.363120] NOHZ: local_softirq_pending 08 [ 343.366264] NOHZ: local_softirq_pending 08 [ 343.367370] NOHZ: local_softirq_pending 08 [ 343.368599] NOHZ: local_softirq_pending 08 [ 343.369537] NOHZ: local_softirq_pending 08 [ 343.370786] NOHZ: local_softirq_pending 08 [ 343.374552] NOHZ: local_softirq_pending 08 [ 343.375805] NOHZ: local_softirq_pending 08 [ 343.376777] NOHZ: local_softirq_pending 08 [ 402.700957] mcp25xxfd spi0.1 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo [ 402.701055] mcp25xxfd spi0.1 can0: tefif: fifo 1 not pending - tef data: id: 0b066825 flags: 00000217, ts: 04947926 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced [ 402.702035] mcp25xxfd spi0.1 can0: tefif: fifo 2 not pending - tef data: id: 03c34c04 flags: 00000435, ts: 04947d64 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced [ 402.703176] mcp25xxfd spi0.1 can0: tefif: fifo 3 not pending - tef data: id: 00000479 flags: 0000068b, ts: 04948196 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced [ 402.703214] mcp25xxfd spi0.1 can0: tefif: fifo 4 not pending - tef data: id: 14d04342 flags: 0000089c, ts: 049485bf - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced [ 402.704475] mcp25xxfd spi0.1 can0: tefif: fifo 5 not pending - tef data: id: 00000253 flags: 00000ac0, ts: 049489d3 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced [ 402.704512] mcp25xxfd spi0.1 can0: tefif: fifo 6 not pending - tef data: id: 177f51db flags: 00000c9e, ts: 04948e1a - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced [ 402.704546] mcp25xxfd spi0.1 can0: tefif: fifo 7 not pending - tef data: id: 041ee9fb flags: 00000e15, ts: 04949245 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproducedI've tried different values for
-g, but it just takes slightly longer to fail. Also I've tried to halfspimaxfrequencyfollowing the suggestion in the kernel log.
I also have same issue with you. do you have already solve this issue?
We tested with a board using the MCP2518FD where the errata of the MCP2517 have been fixed. This did not solve the issue with the "Something went wrong - we got a TEF..." error.
@johandc Do you have to restart the interface or your system in this case?
@Skypodo yes. If dn and if up again.
@johandc Did it improve than in any way? Did you notice any differences in behaviour between MCP2517FD and MCP2518FD? I noticed that reducing the SPI frequency (i.e. 2.5 MHz) or increasing the message interval (~1k messages per second) increases the error rate.
@Skypodo
I don't know, it feels very random to me when the error hits, so its subjective. But i felt like there was very little difference between the two chips.
There are two issues, as far as i can see.
- That the error occurs and
- That the driver does not try to recover automatically.
If the driver would receover itself, i could accept to loose a can frame once in a while. But having it locking up, is the worst possible scenario.
@johandc That has been also my experience for now. Once the driver is stuck, there is no way it recovers and you have to turn the interface off and on again. Over the holidays I've done some long-term tests to see how it might behave in our particular use case. SPI was running at 10 MHz and both of the CAN nodes sent random frames each 10/10.07 ms. This was running for 16 days without major errors. I had some bus errors occasionally but they were very rare and the driver managed to keep going. For my use case this is sufficient for now.
I'm not sure if it's related to this particular issue, but I ran into a problem with the driver not recovering from a BUS_OFF state. The mcp2518fd cancels all pending transmissions in this state, but the driver thinks they're still pending. This leads to a lockup since as far as the driver's concerned there's no free TX queues.
The driver needs to cancel its pending transmissions when a BUS_OFF error is detected (and report them as failed to userspace). The problem is that sometimes there's weird race conditions between detecting the BUS_OFF state change and queuing a new frame for TX asynchronously, so it's really difficult to figure out which frames are still queued or not for TX. An occasional check for lockup (if no frame TX complete interrupts have occurred in X seconds) can somewhat work around that issue.
As we are facing the same problems I've investigated further in the issue with not pending FIFOs and the problem is located in the optimized interrupt handler mcp25xxfd_can_tx_handle_int_tefif_optimized().
If you remove the handling and just use mcp25xxfd_can_tx_handle_int_tefif_conservative() in mcp25xxfd_can_tx_handle_int_tefif() the classical (and more SPI bandwith consuming) handling is used and the driver should work stable.
The problem with the optimized version seems to be a timing issue. Now and then - with our application even after hours of running w/o problems - more than one TEF is read in advance, meaning not only the signalled CiTEFUA (that's not even used in the driver I think) is read, but also consecutive TEFs are readout. This works fine most of the times, but anytime after the read a TEFINT is triggered that leads to a readout of not ready TEFs. We believe that this is due to some readout while not all TEFs have been written by the chip completely, so the TEFNEIF flag is set after the readout itself has already taken place.
Once in this situation, every further send of CAN(-FD) messages lead to a readout of a wrong TEF, a missing interrupt and - in the end - no idle TX-FIFOs, which in turn make the driver stop sending. ifdown and ifup lead to re-initialization, so the interface is working again.
I've created a support case with microchip to clarify if a readout other than at the CiTEFUA address for the TEF is safe according to concurrent updates, as for the moment we cannot prove this (mcp25xxfd datasheet and reference manual is not clear at this point). I'll keep this updated if I get any news.
That is excellent news. At least there is a workaround for now. Thank very much for looking into this.
I did another test and was able to confirm via the TEF timestamp that the not-pending FIFO TEFs are definitely the ones from the last send cycle of the 6 FIFOs. This proves the out-of-sync between the driver handling of the TEFs ready to be read and the chip itself providing the information about the ready TEFs with CiTEFUA, TEFNEIF flag and the corresponding TEFINT.
So for now, the safest option is to stick with mcp25xxfd_can_tx_handle_int_tefif_conservative(). Maybe it's useful to check the above mentioned flags/address to make the handling more robust (while of course there's more SPI communication) - especially if there will occur strange behavior with the conservative handling.
Also we've got the idea making use of multiple message objects per TX fifo as described in the reference manual, which in turn will lead to a different handling regarding the TEFs. While we are currently tight on schedule in our project, there may be a contribution from our side in the future.
We can also confirm this appears to solve the issue. We did however also notice significant performance improvements after switching from the MCP2517FD to the MCP2518FD, so if anyone has issues it might be worth to take out your hot air gun.
@heikoschmidttki how did you test your solution? May you also describe what your setup looks like (SPI frequency, SYSCLK, timings)? When using your patch I still get some errors like while generating CAN FD frames with cangen -g 0 -i -x -f can0
tefif: fifo %i not pending - tef data: id: %08x flags: %08x, ts: %08x - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproducedfound IVMIF situation not supported by driver - bdiag = [0x0000fda0, 0x40009440]CAN Bus error experienced
Once there are enough bus errors, the bus turns off and restarting the interface, the driver is unable to send any more messages.