signalk-server CAN0 failure with RPI4b and Pican-M hat caused by signalk-rpi-monitor at low intervals

Previously reported that CAN0 randomly stops streaming data (via canboatjs) from a Pican-M hat on a RPI4b connected to a N2k network. [#1622]

Okay so after a LOT of investigation I've been able to isolate the problem and the bug is (mostly) recreatable.

The problem stems from the signalk-rpi-monitor interval. When the monitor plugin is set to report at an interval of say 5s (the default is 60s) CAN0 is likely to fail after 20min or so. The lower the interval the more likely the failure will occur in short order.

I haven't isolated the root cause of the problem (I'm guessing that it has something to do with nodejs's intermittent garbage collection cycle) but it's a fairly serious problem as the only way to get one's instruments back is to restart the Signalk server process.

I'd bet that even at the default 60s interval the CAN bus will, in time, fail (silently). At sea, this could be problematic to say the least.

To recreate the problem you'll need a CAN0 connection of some kind on an RPI 3/4b (only tested on 4b w/Pican-M) and signalk-rpi-monitor installed and its interval set to a sufficiently low value.

For the time being, boaters with signalk-rpi-monitor installed on a RPI with a CAN0 bus should, at the very least, take warning and definitely not set their rpi-monitor interval to a value less than the default.

As for a longer term solution, further investigation into canboatjs is warranted as the failure is not occurring at the system level (candumpjs/candump continue to function normally after signalk's CAN0 data connection fails).

Sep 17 '23 15:09 naugehyde

Whoa. Thanks for investigating and sharing. Created https://github.com/sberl/signalk-rpi-monitor/issues/9 with link here and posted link to this on Slack.

Sep 17 '23 18:09 tkurki

Maybe we close the other issue and edit the title of this one to mention CAN and signalk-rpi-monitor?

Sep 17 '23 18:09 tkurki

I'll be interested to see if others can recreate the problem on different hardware and different canboat connections. I have a waveshare CAN hat rig that I'll try later today.

In the meantime, setting the @canboat/canboatjs option noDataReceivedTimeout to an absurdly large value (60s) to see if the problem is mitigated.

I suspect the problem is only triggered by signalk-rpi-monitor (which of necessity uses the spawn() function repeatedly) but may be the inevitable result of a garbage collecting program environment.

Sep 17 '23 19:09 naugehyde

One stopgap solution against any plugin (not just rpi-monitor) that's doing whatever rpi-monitor is doing and causing CAN0 to go down is to set the "noDataReceivedTimeout" suboption for the connection in your settings.json.

e.g.: "pipedProviders": [ { "enabled": true, "id": "can0", "pipeElements": [ { "options": { "logging": true, "subOptions": { "noDataReceivedTimeout":2, "interface": "can0", "type": "canbus-canboatjs", "uniqueNumber": 1159732, "useCanName": true, "filtersEnabled": false, "filters": [], "overrideTimestamp": true }, "type": "NMEA2000" }, "type": "providers/simple" } ] },

Sep 17 '23 23:09 naugehyde

@naugehyde how is the can0 interface configured? In /etc/network/interfaces or /etc/network/interfaces.d? Could you share the configuration?

Sep 18 '23 06:09 mairas

My theory: PICAN-M uses MCP2515 as the CAN controller (just like most other Pi devices out there). That IC has an RX queue of only 2 frames. Signalk-rpi-monitor spawns external processes to read the information. Even though this should be an async operation, I dunno, maybe this still causes some blocking in nodejs main loop. If this is the case, it could make the RX buffers sometimes overflow. I believe overflows cause the RX error counter to increase, and once REC goes above 127, the CAN controller enters an Error-Passive state, causing any transmissions to cease.

Sep 18 '23 07:09 mairas

Thanks, that's interesting.

FYI rpi-monitor spawns system shells and executes system calls (mpstat etc) to get its data. It's possible one or more of the commands is blocking the CAN IP device but I doubt it as CAN0 continues to function normally after canboatjs has hiccupped (a candump open in another terminal will stream continuously). It stands to reason that candump would reflect any blockage and timeout/retry in its stream.

(As I noted earlier, setting a noDataReceivedTimeout value in the CAN0 suboptions causes canboatjs to retry after hiccupping and the problem is minimized as far as SK is concerned.)

But yeah it remains a mystery why canboatjs is hiccupping at all. The nodejs event loop should not be blocked by spawned system calls. If it is the case that the CAN device buffer is overflowing then canboatjs needs to handle it more gracefully. But that still doesn't answer the question of what rpi-monitor is doing to cause the problem in the first place!

On Mon, Sep 18, 2023 at 12:01 AM Matti Airas @.***> wrote:

My theory: PICAN-M uses MCP2515 as the CAN controller (just like most other Pi devices out there). That IC has an RX queue of only 2 frames. Signalk-rpi-monitor spawns external processes to read the information. Even though this should be an async operation, I dunno, maybe this still causes some blocking in nodejs main loop. If this is the case, it could make the RX buffers sometimes overflow. I believe overflows cause the RX error counter to increase, and once REC goes above 127, the CAN controller enters an Error-Passive state, causing any transmissions to cease.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1722850394, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRGDNVY6FDKS4FZP2BLX27WTFANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

Sep 18 '23 07:09 naugehyde

@naugehyde could you please provide can0 configuration settings I requested above?

Sep 18 '23 10:09 mairas

Note that IF the issue is due to the RX buffers overflowing, canboatjs has nothing to do with it and it can do nothing about it.

Sep 18 '23 10:09 mairas

The config is the vanilla config prescribed by the manufacturer:

auto can0 iface can0 inet manual pre-up /sbin/ip link set can0 type can bitrate 250000 up /sbin/ifconfig can0 up down /sbin/ifconfig can0 down

NB: I've recreated the issue on a rig with a different mcp2515 can hat ( https://www.waveshare.com/rs485-can-hat.htm).

Not getting reports of buffer overflows from the driver-- as I stated earlier, the HW continues to function normally (if candump/candumpjs are to be believed). Also, it's only canboatjs that's timing out and failing. The event loop continues to stream from other sources after canboatjs stops reporting CAN0 data. And FWIW, canboatjs, when noDataReceivedTimeout is set, will report receiving no data then retry and function normally after a brief delay.

On Mon, Sep 18, 2023 at 3:41 AM Matti Airas @.***> wrote:

Note that IF the issue is due to the RX buffers overflowing, canboatjs has nothing to do with it and it can do nothing about it.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1723156664, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRB742EQEH57BJTHTQTX3AQN3ANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

Sep 18 '23 14:09 naugehyde

I recommend adding "restart-ms 100" to the pre-up line. That makes the kernel reset the CAN controller after 100 ms if the controller enters bus-off state. That might not apply in this case because bus-off kills both RX and TX, and you said RX still works. But it's a good in any case.

OK, and finally found:

If you can get TX to stop again (the error state you reported), try the following command (before applying the above change):

ip -details -statistics link show can0

What does it report as state? Normally it should be ERROR-ACTIVE. ERROR-PASSIVE means that the controller is in passive state; TX is disabled but RX still works. And BUS-OFF is completely dead.

With this, we can decisively narrow down the issue; either to the CAN controller & kernel driver or userspace.

Sep 18 '23 14:09 mairas

Excellent! Thank you. Will give this a go ASAP and let you know.

On Mon, Sep 18, 2023 at 7:29 AM Matti Airas @.***> wrote:

I recommend adding "restart-ms 100" to the pre-up line. That makes the kernel reset the CAN controller after 100 ms if the controller enters bus-off state. That might not apply in this case because bus-off kills both RX and TX, and you said RX still works. But it's a good in any case.

OK, and finally found:

If you can get TX to stop again (the error state you reported), try the following command (before applying the above change):

ip -details -statistics link show can0

What does it report as state? Normally it should be ERROR-ACTIVE. ERROR-PASSIVE means that the controller is in passive state; TX is disabled but RX still works. And BUS-OFF is completely dead.

With this, we can decisively narrow down the issue; either to the CAN controller & kernel driver or userspace.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1723544308, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRB22QFJU7N2BMV2LILX3BLFDANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

Sep 18 '23 15:09 naugehyde

FYI the bug is easily recreated by setting rpi-monitor's interval to 1s. I'm not near my boat today and can only connect to it over 2G so it may take me a day or so to get to testing. If you have access to a N2K network atop an RPI with a mcp2515 controller, I'd love to know what you discover.

On Mon, Sep 18, 2023 at 8:04 AM Andrew Gerngross @.***> wrote:

Excellent! Thank you. Will give this a go ASAP and let you know.

On Mon, Sep 18, 2023 at 7:29 AM Matti Airas @.***> wrote:

I recommend adding "restart-ms 100" to the pre-up line. That makes the kernel reset the CAN controller after 100 ms if the controller enters bus-off state. That might not apply in this case because bus-off kills both RX and TX, and you said RX still works. But it's a good in any case.

OK, and finally found:

If you can get TX to stop again (the error state you reported), try the following command (before applying the above change):

ip -details -statistics link show can0

What does it report as state? Normally it should be ERROR-ACTIVE. ERROR-PASSIVE means that the controller is in passive state; TX is disabled but RX still works. And BUS-OFF is completely dead.

With this, we can decisively narrow down the issue; either to the CAN controller & kernel driver or userspace.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1723544308, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRB22QFJU7N2BMV2LILX3BLFDANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

Sep 18 '23 15:09 naugehyde

Before CAN0 failure:

4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10 link/can promiscuity 0 minmtu 0 maxmtu 0 can state ERROR-ACTIVE restart-ms 0 bitrate 250000 sample-point 0.875 tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 mcp251x: tseg1 3..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1 clock 8000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 0 0 0 0 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped missed mcast 52248708 6531116 0 1233 0 0 TX: bytes packets errors dropped carrier collsns 14640232 1830069 0 0 0 0

After:

4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10 link/can promiscuity 0 minmtu 0 maxmtu 0 can state ERROR-ACTIVE restart-ms 0 bitrate 250000 sample-point 0.875 tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 mcp251x: tseg1 3..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1 clock 8000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 0 0 0 0 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped missed mcast 52811163 6601425 0 1233 0 0 TX: bytes packets errors dropped carrier collsns 14767240 1845950 0 0 0 0

No relevant change that I can see.

Sep 18 '23 16:09 naugehyde

OK, interface is fine both before and after. And it's not about dropped packets because the dropped count didn't change, and error count remains zero. So it is a userspace issue!

Sep 19 '23 15:09 mairas

Yes. Something is stalling the RX at the canboatjs level (socketcan functions normally throughout) and causing a timeout that fails silently. The offending plugin that triggered all of this (rpi-monitor) isn't making any blocking calls that I can see (Node::spawn() for one is documented as non-blocking), so I am mystified as to the root cause.

FWIW other users have reported canboatjs stream failures ( https://github.com/canboat/canboatjs/issues/196 and https://github.com/canboat/canboatjs/issues/193) but I can't speak to the details of their configurations.

My next test is to create a plugin that mimics rpi-monitor spawn'ed system calls and see if there's an offending system call at the root of this or if it is in fact Node::spawn() somehow messing things up.

In the meantime, SK users are advised to add the noDataReceivedTimeout option to canboatjs as I detailed earlier in this thread ( https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1722595149 ).

On Tue, Sep 19, 2023 at 8:07 AM Matti Airas @.***> wrote:

OK, interface is fine both before and after. And it's not about dropped packets because the dropped count didn't change, and error count remains zero. So it is a userspace issue!

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1725835519, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRG5TIFL7U7EMC2YTSDX3GYKRANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

Sep 19 '23 15:09 naugehyde

@naugehyde 👏 for investigating this! Let's concentrate communications here for the time being, as we dont' know what combination triggers the problem.

I think the best way forward would be to create as low level repro for this as we can, starting with just socketcan and spawn. Could you try something like this?

Sep 22 '23 05:09 tkurki

Thanks for the code. Will put it to use soon.

If it's a resource issue owing to the overhead of frequent shell creation, Nodejs's garbage collection cycle could be running long enough for the CAN0 nodejs stream to programmatically timeout . If this is true (will test inside SK first) running the test outside of SK will change the running memory footprint so I may have to adjust the V8 heap size to invoke the timeout.

Sep 22 '23 15:09 naugehyde

Hey all, possibly off topic, but did want to share here:

that Pican-M is using the MCP2515, which has only two RX buffers.

That puts quite the requirement on speedy processing by the main CPU.

When designing hardware on our end, and needing an external CAN controller (as opposed to one inside a microcontroller, which almost always has more RX buffers than only 2); we quite quickly discarded the MCP2515 as an option.

And worked on/with the MCP2518 or 2519 instead. Much more buffers. And in case of N2K, especially the existence of fastpackets adds to the requirement of having sufficient RX buffers.

Supported in Linux kernel 5.10 and up if I’m not mistaken.

in other words: on my boat, I’d try to keep it free from MCP2515. Missing packets leads to the weirdest of issues, very hard to diagnose.

Sep 22 '23 18:09 mpvader

I don't argue about MCP2515 being less than stellar, but the impact isn't necessarily that huge. My boat server, based on a Raspberry Pi CM4 and Waveshare's isolated CAN HAT (with MCP2515), at the moment has been running for 8 days straight. CAN RX packet counter is at 82292123 (627.7 MiB) and the dropped packets counter is at 5227. That's one dropped packet every 2 minutes. Not nothing, but the packet loss rate is 5227/82292123, or 0.006%. Given the repetitive nature of N2K traffic, that's not a huge deal. But this is something where YMMV a lot.

Sep 22 '23 19:09 mairas

I agree that the RX buffers on the MCP2515 may be insufficient for some scenarios. In a large harbor like mine (Marina Del Rey) the N2K stream is flooded with AIS position updates.) It doesn't explain why the canboatjs fails silently as explained earlier in the thread but a more robust CAN receiver could only help matters.

I would happily test an MCP2518 with Signalk, if I could find a board for less than $20 that easily interfaced with an RPi.

Do you have any suggestions, @mpvader? I found this product but it's not clear from the docs how to interface it with an RPi.

Sep 22 '23 19:09 naugehyde

I would happily test an MCP2518 with Signalk, if I could find a board for less than $20 that easily interfaced with an RPi. Do you have any suggestions, @mpvader?

Sorry, I wouldn't know. I checked a bit, and can't find any for that price. There are some, under EUR 100. It looks like the low cost makers have not switched over (yet).

I don't argue about MCP2515 being less than stellar, but the impact isn't necessarily that huge.

Yes I agree completely on the YMMV and I can see that it will work in most cases. It will depend on business of the canbus vs CPU idle time. Note that not all can controllers have proper implementation of error and drop counters. I don't remember the details around that though. And while at it, the adapter/protocol you really don't want for a complex protocol like NMEA2000 is SLCAN.

Btw. we're looking at also from a view of having the network 100% reliable for changing configuration, and so forth. Not just repetitive data.

Sep 22 '23 20:09 mpvader

To add: an implementation that we also tested extensively, and found to be very reliable, is the candlelight project, which uses the gsusb driver between the CAN-CPU and the Linux-CPU.

Sep 22 '23 20:09 mpvader

Hi, our MacArthur HAT v1.2 is about to be released as stable and it has the MCP2518 tested and working: https://github.com/OpenMarine/MacArthur-HAT/tree/v1.1

All the prototypes are installed and working and I only have one on hand now, but we will soon go into production. Maybe we can leave you a prototype temporarily, we are in Barcelona and Washington, where are you?

Sep 25 '23 10:09 sailoog

I'd happily test a prototype if I had one.

I am in Los Angeles as it happens.

On Mon, Sep 25, 2023 at 3:12 AM Sailoog @.***> wrote:

Hi, our MacArthur HAT v1.2 is about to be released as stable and it has the MCP2518 tested and working: https://github.com/OpenMarine/MacArthur-HAT/tree/v1.1

All the prototypes are installed and working and I only have one on hand now, but we will soon go into production. Maybe we can leave you a prototype temporarily, we are in Barcelona and Washington, where are you?

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1733367360, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRFZDVGLBHODRHVWW53X4FKKLANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

Sep 25 '23 11:09 naugehyde

I'd happily test a prototype if I had one. I am in Los Angeles as it happens.

I can hook you up with a MacArthur HAT with the MCP2518. Drop me an email to adrian at adrianstuder dot com

Sep 26 '23 21:09 astuder

@naugehyde any news on this? Have you had the chance to test the minimal repro we talked about above?

Oct 05 '23 05:10 tkurki

@tkurki Currently testing with a MCP2518 equipped CAN hat courtesy @astuder. The MCP2518 has a much larger buffer space than the MCP2515 that's on the PICAN-M and the Waveshare CAN hat.

Just got the rig up and running today. Was not able to recreate with the MCP2518 hat but it's just day one and only had time for half an hour's testing.

Will see if I can recreate w/ the MCP2518 then get back to testing on a MCP2515 equipped hat.

It will be very interesting if I cannot recreate the bug on a MCP2518 equipped hat as the common denominator in my tests on two different boards (RPI 4b and a Le Potato) with two different CAN hats was the MCP2515 controller.

Oct 05 '23 05:10 naugehyde

@tkurki Should I test on 2.3.0 or 2.2.0? I note with interest that 2.3 has an upgraded canboatjs.

Oct 05 '23 22:10 naugehyde

Latest makes most sense I guess.

Oct 06 '23 04:10 tkurki