[Bug]: own message sent out received again about 20 minutes later
Category
Other
Hardware
Rak4631
Is this bug report about any UI component firmware like InkHUD or Meshtatic UI (MUI)?
- [ ] Meshtastic UI aka MUI colorTFT
- [ ] InkHUD ePaper
- [ ] OLED slide UI on any display
Firmware Version
2.7.15.567b8ea Alpha
Description
Hello
I have my RAK4631 node set as client_mute and using firmware 2.7.15.
I send out a message, replying to another message that comes in in channel 0.
It shows up in the app, so far, so good and unsuspect. Debug log of my sent-out-message:
to: 4294967295 (!ffffffff)
id: 4080155345
want_ack: true
decoded {
portnum: TEXT_MESSAGE_APP
payload: "xxxxxxxxx 3 hops. Gr\303\274ess xxxx"
reply_id: 3756052286
}
Decoded Payload:
{
xxxxxxxxx 3 hops. Grüess xxxx
}
34 minutes later, I get my own message again in the app, with the same id, this time indicating though that it has been around in the mesh and now returning to me. Coming from me, yes, same message id. Same replay-id. Now, maybe this is important: the node mentioned as relay_node runs on firmware 2.6.11:
from: 1306336680 (!xxxxxxda8) to: 4294967295 (!ffffffff) id: 4080155345 rx_time: 1764358238 rx_snr: 2.75 rx_rssi: -105 hop_start: 5 relay_node: xxxxxxxxxxxch (!xxxxxx697) transport_mechanism: TRANSPORT_LORA decoded { portnum: TEXT_MESSAGE_APP payload: "xxxxxxxxx 3 hops. Gr\303\274ess xxxx" reply_id: 3756052286 bitfield: 0 } Decoded Payload: { xxxxxxxxx 3 hops. Grüess xxxx }
@GUVWAF I remember in FW 2.13, you merged stuff into master that made duplicates disappear. https://github.com/meshtastic/firmware/pull/8450
Here now in 2.7.15, the behavior happens again. I am unsure if it is related.
@h3lix1 I think you also worked on some duplicate removal stuff in 2.7.12: https://github.com/meshtastic/firmware/pull/8148/files
Is this maybe another case of the master branch not having important changes in it? Does not look like it, though. This time, compared to https://github.com/meshtastic/firmware/issues/8449, the bevavior is different. Back in 2.7.12 and fixed in 2.7.13, it was other people's messages I received twice. This time in 2.7.15, it is my own sent message I receive coming back at me from the mesh.
Relevant log output
Is your mesh quite busy? This might be actually the PacketHistory (which is used to filter duplicates) being full. If it got overwritten in those 20 min. and then still the packet was going around, it would not ignore it anymore. In this case there should be clues in the serial logs: https://github.com/meshtastic/firmware/blob/de26dfe46837367addb35bd163363179d62a55f3/src/mesh/PacketHistory.cpp#L308-L309
Is your mesh quite busy?
@GUVWAF Yes, we had our French-speaking cousins from Western Switzerland join the mesh from Long_Fast to our Medium_Fast. 618 nodes total, 181 online. MEDIUM_FAST. We are trying to educate our meshers in Western Switzerland a) not to use LoRa maxHops 7 b) to increase telemetry intervals to 12h our more c) not to use Smart Position
Channel util on the mountain nodes -- most all of them Router_Late, except for the Router node that relayed the duplicate message to me -- is below 30%, though, mostly hovering around 25%, that's fine, I think.
I just now connected my node to USB and looked at the serial log. Of course, I cannot look back now to the point in time the error occured, but I have some "reusing slots aged" messages even now in the debug log, accessed via flasher.meshtastic.org and serial log button / USB port:
hopStart=7 relay=0x18)
DEBUG | 20:18:26 283094 [RadioIf] Packet RX: 272ms
INFO | 20:18:26 283094 [Router] Packet History - insert: Reusing slot aged 8.152s TRACE MATCHED PACKET
DEBUG | 20:18:26 283094 [Router] Ignore dupe incoming msg (id=0x1515f2fe fr=0xea341d68 to=0x113ba723, transport = 1, WantAck=0, HopLim=1 Ch=0x1f encrypted len=91 rxSNR=-8.75 rxRSSI=-112 hopSt
DEBUG | 20:18:26 283094 [Router] cancelSending id=0x1515f2fe, removed=0
DEBUG | 20:18:26 283094 [Router] Incoming msg was filtered from 0xea341d68
DEBUG | 20:18:26 283094 [RadioIf] Corrected frequency offset: -143.375000
DEBUG | 20:18:26 283094 [RadioIf] Lora RX (id=0x0729a791 fr=0x9f7ed214 to=0xffffffff, transport = 0, WantAck=0, HopLim=2 Ch=0x1f encrypted len=54 rxSNR=-9 rxRSSI=-112 hopStart=5 relay=0x18)
DEBUG | 20:18:26 283094 [RadioIf] Packet RX: 190ms
DEBUG | 20:18:26 283094 [Router] Packet History - Hop limit upgrade: packet 0x0729a791 from hop_limit=1 to hop_limit=2
INFO | 20:18:26 283094 [Router] Packet History - insert: Reusing slot aged 3.190s TRACE MATCHED PACKET
DEBUG | 20:18:26 283094 [Router] Ignore dupe incoming msg (id=0x0729a791 fr=0x9f7ed214 to=0xffffffff, transport = 1, WantAck=0, HopLim=2 Ch=0x1f encrypted len=54 rxSNR=-9 rxRSSI=-112 hopStart
DEBUG | 20:18:26 283094 [Router] cancelSending id=0x729a791, removed=0
DEBUG | 20:18:26 283094 [Router] Incoming msg was filtered from 0x9f7ed214
DEBUG | 20:18:28 283096 [RadioIf] Corrected frequency offset: -406.875000
DEBUG | 20:18:28 283096 [RadioIf] Lora RX (id=0x6e9d4df5 fr=0xcf22b193 to=0xffffffff, transport = 0, WantAck=0, HopLim=3 Ch=0x1f encrypted len=50 rxSNR=1.75 rxRSSI=-101 hopStart=5 relay=0x97)
DEBUG | 20:18:28 283096 [RadioIf] Packet RX: 180ms
INFO | 20:18:28 283096 [Router] Packet History - insert: Reusing slot aged 998.430s TRACE OLDEST SLOT
I also see a notice: no suitable channel found for decoding at times. Does that have an impact?
INFO | 20:18:42 283111 [Router] Packet History - insert: Reusing slot aged 1008.610s TRACE OLDEST SLOT
WARN | 20:18:42 283111 [Router] No suitable channel found for decoding, hash was 0xaa!
DEBUG | 20:18:42 283111 [Router] packet decoding failed or skipped (no PSK?) (id=0xb372bb02 fr=0xdc9b4cf5 to=0xffffffff, transport = 1, WantAck=0, HopLim=3 Ch=0xaa encrypted len=39 rxtime=176
DEBUG | 20:18:42 283111 [Router] Module 'routing' wantsPacket=1
INFO | 20:18:42 283111 [Router] Received routing from=0xdc9b4cf5, id=0xb372bb02, portnum=23, payloadlen=49472
DEBUG | 20:18:42 283111 [Router] Module 'routing' considered
DEBUG | 20:18:43 283111 [RadioIf] Corrected frequency offset: -406.875000
DEBUG | 20:18:43 283111 [RadioIf] Lora RX (id=0xb372bb02 fr=0xdc9b4cf5 to=0xffffffff, transport = 0, WantAck=0, HopLim=2 Ch=0xaa encrypted len=39 rxSNR=1.5 rxRSSI=-102 hopStart=5 relay=0x97)
DEBUG | 20:18:43 283111 [RadioIf] Packet RX: 150ms
INFO | 20:18:43 283111 [Router] Packet History - insert: Reusing slot aged 0.830s TRACE MATCHED PACKET
DEBUG | 20:18:43 283111 [Router] Ignore dupe incoming msg (id=0xb372bb02 fr=0xdc9b4cf5 to=0xffffffff, transport = 1, WantAck=0, HopLim=2 Ch=0xaa encrypted len=39 rxSNR=1.5 rxRSSI=-102 hopStar
DEBUG | 20:18:43 283111 [Router] cancelSending id=0xb372bb02, removed=0
DEBUG | 20:18:43 283111 [Router] Incoming msg was filtered from 0xdc9b4cf5
DEBUG | 20:18:45 283113 [RadioIf] Corrected frequency offset: 267.375000
DEBUG | 20:18:45 283113 [RadioIf] Lora RX (id=0xb372bb02 fr=0xdc9b4cf5 to=0xffffffff, transport = 0, WantAck=0, HopLim=2 Ch=0xaa encrypted len=39 rxSNR=-13.25 rxRSSI=-116 hopStart=5 relay=0xf)
DEBUG | 20:18:45 283113 [RadioIf] Packet RX: 150ms
INFO | 20:18:45 283113 [Router] Packet History - insert: Reusing slot aged 1.735s TRACE MATCHED PACKET
I wonder if maybe we are seeing an unwanted side effect of 0-cost-hops feature. Could it be that when infra nodes of type Client_Base, Router, Router_Late do not decrement hop counts when packages come from their favorites, the volume of messages explodes and they stay around for too long?
I think most of our mountain nodes are still in 2.6.11, where accidental favoriting on infra nodes / 0-cost-hops is not an issue, but I don't know about that the infra nodes in Western Switzerland have as their firmware version.
Thanks for your detailed answer.
This was the most interesting line:
INFO | 20:18:28 283096 [Router] Packet History - insert: Reusing slot aged 998.430s TRACE OLDEST SLOT
So it means indeed already after around 17 minutes the packet history rolls over, while the packet still is traveling around the mesh. nRF52 has the least amount of RAM, so we keep the PacketHistory size relatively low, and apparently in some cases it's too low.
@Marek-mk How do you feel about this? Do you think we should increase the maximum size?
I also see a notice: no suitable channel found for decoding at times. Does that have an impact?
It means some people are using private channels. It has the same impact on the PacketHistory as other packets.
I wonder if maybe we are seeing an unwanted side effect of 0-cost-hops feature.
Yes, this has definitely an impact on how long the packet is traveling on the mesh.
I wonder if maybe we are seeing an unwanted side effect of 0-cost-hops feature.
Yes, this has definitely an impact on how long the packet is traveling on the mesh.
Oh man, this will ruin the mesh here ... does no one else see the bad impact of abusing a field like is_favorite for such critical functionality as
a) router-like prioritized sending-out from Client_Base nodes
b) tagging from-nodes to be handled traffic from at 0-cost, non-hop-decrementing ?
a and b have huge side effects in a busy, large mesh.
It means some people are using private channels. It has the same impact on the PacketHistory as other packets.
Yeah, I and others are starting to make people aware of the fact that on EU_868, we are all on the one, same, frequency slot. We might have to, for our mountain nodes, move from rebroadcast strategy ALL to LOCAL_ONLY across the country. That way, in their regions, people can still use their own private channels, but it will not contribute to message volume across the country anymore.
I know this is only indirectly related, but 0-cost-hop routing when coming from favorites should either be gated to the really short LoRa presets or at least the max. number of favorites on infra nodes of type Client_Base, Router, Router_Late should be limited, to e.g. 2. @h3lix1 what do you think? Here is a view of our current MEDIUM_FAST Swiss mesh, by the way. 620 nodes total from my perspective, of which 180 online.
We absolutely love Meshtastic, and we are learning a lot along the way. Hopefully, our inputs make some sense. I am neither against 0-cost hops nor Client_Base, but it seems to me the side effects in a relatively loosely-managed mesh like a countrywide one have not been considered.
@Marek-mk How do you feel about this? Do you think we should increase the maximum size?
Whoa. This seems to be a big network and busy chutil. We here switched to ShortFast some months ago, because we got delays, losses and chutils over 30%. Now over 600 nodes total, half of them 'online' (seen in last hour?) and chutils < 20% on routers. I also noticed router_late giving unnecessary delays of my own packets too, and overflowing transmit limit quickly, so I decided to not use it. Now client_base seems as a good feature, but I have not yet tested it.
...but message reappearing after ~20 minutes seems weird. Either someone replaying packet to the network for 'fun', or more likely, as author @shalberd described the network - a loop is created.
I can imagine having MF network consisting of many router_late nodes to be the cause of packet being held by node, that can't transmit for long time because of huge traffic (chutil), and hitting airutil tx limit, and releasing packet to the network after long time, so packet travels for long time and reappears at source.
We could try to increase the size of PacketHistory, but I think this is circumvention of bad network configuration - MF, many routers_late, but not resolve the source of this late ghost packets at the cost of memory usage, and this ghost packet eventually dies and does not create avalanche?
does no one else see the bad impact of abusing a field like is_favorite for such critical functionality
Yes, I agree we'd need a dedicated flag for this, but let's discuss this in #8555.
We could try to increase the size of PacketHistory, but I think this is circumvention of bad network configuration
I tend to agree indeed.
We've had this situation in BayMesh before 0-hop routing was deployed, but it was mostly around people downlinking MQTT messages. It would be interesting to see if this was due to MQTT.. the second message has a "( ..." at the end of the long owner name which might be (MQTT) tag, but I can't tell from the image.
If there is a way to trace the message via MQTT loggers (including flags), that would be a great benefit here. I don't think this is due to overflowing the queue, and I'd be very impressed if messages come back 30 minutes later that can be blamed on 0-hop routing.
How many hops do you think a packet could go through that 0-hop routing would possibly be the cause here? For 30 minutes? With all with routers with 0-hop routing enabled? Not a chance. None. Even if there is one or two routers that 'accidentally' have favorited their neighbor router, the hops will run out quickly enough.
I'm starting to get annoyed that every gremlin in the system is getting blamed on 0-hop routing when any bad actor (or bug) can simply replay a message to cause this to happen. Every message in BayMesh makes it through the network within 10 seconds to over 700 nodes spanning over 600km, and we extensively use 0-hop meshing with no difficulty.
I'm starting to get annoyed that every gremlin in the system is getting blamed on 0-hop routing when any bad actor (or bug) can simply replay a message to cause this to happen. Every message in BayMesh makes it through the network within 10 seconds to over 700 nodes spanning over 600km
I agree 0-hop routing is not that big of a factor here, but the fact that one of the packets makes it in 10 seconds does definitely not mean duplicates can be floating around for much longer than that. Each hop adds delay, and a hard limit also limits the chance that a packet circles back via another route.
ROUTER_LATE also plays a big role here, as it pushes the packets to the back of the queue. If it doesn't have a chance to send the others first because of high channel utilization, it will get delayed and delayed, and with a slow preset this can quickly add up.
It would be good indeed to rule out other factors like MQTT, but we can't deny that we need to be careful with loops in general.
duplicates can be floating around for much longer than that
Hello @GUVWAF you are right, a high density of ROUTER_LATE and ROUTER, especially in very good locations in an area, leads to duplicates whirling around on the Mesh. That makes sense, because, after all, both of those roles ALWAYS retransmit ANY message / packet they can get a hold of.
We will try in the the lake Luzerne and "great plains" area of Switzerland to
- reduce the number of ROUTER_LATE by
- changing some of those ROUTER_LATE to CLIENT role. That should markedly reduce duplicates whizzing around.
@h3lix1 no hard feeling regarding 0-cost hops for marked favorites. Even we here in Europe do see the appeal and need. However, as others have probably mentioned before, we have a very different frequency slot situation over here, so we need to be careful on that, unfortunately, just one frequency slot on EU_868 that we have. In general, I also feel that
- targeted direct message NextHopRouting vs.
- trying to spread out messages as far as possible are a bit contradictory. Maybe something for Meshtastic 3.0 to think of better ways. One thing is for sure: you at Meshtastic are doing a great job, have inspired us here, and have in a way become victim of success. A technology that was thought for 50 hikers in a medium area is not used country-wide. What a testimony to the awesomeness of LoRa. Best regards from all of us here in Switzerland, both to the U.S. as well as Belgium and Australia / NZ.
@shalberd I guess the problem I have is understanding the situation that would cause duplicates to whirling around the mesh. A lot of what is mentioned contradicts my experience with Meshtastic.
I get the feeling 0-hop can be completely disabled and you will still see the problems you are experiencing, but it is very difficult to know without more observability.
Do all the local nodes log to a single mqtt server? That might help with finding the source of your loops.
As for range, you are correct. I guess my philosophy is to avoid one-way conversations on public chats, the only solution is to try to reach everyone possible. Using hop limits to artificially limit distance seems to cause discombobulation for people in-between who can hear one recipient but not the other.
I am very interested in the unique nature of the mesh there to better understand the challenges.
@h3lix1 just briefly before I go to bed here, will answer more later as well:
I get the feeling 0-hop can be completely disabled and you will still see the problems you are experiencing
Absolutely, even with 0-hop, our problems related to too many duplicate packages whirling around is due to the nature of ROUTER and ROUTER_LATE that always retransmits. Therefore, we will try here to reduce the number of ROUTER and ROUTER_LATE. That should reduce duplicate packets in any area, to a certain extent.
Why do we have duplicates: mainly because we do not have north-south or east-west valleys, but mountain nodes that have excellent RF propagation in all directions north, east, west. Our only natural barrier to the South are the Alps, where a lot of nodes are positions on high places. For example, but not limited to (scroll down a bit on that page): https://meshtastic.web-sdr.ch/
So it is only natural, in an area of say 250 square miles, having 4-7 ROUTER/ROUTER_LATE nodes, that there will be duplicates of the same package (by ID) whirling around.
Because CLIENT checks if any other node has already retransmitted a packet, and, if yes, Client does not retransmit, in our dense scenario, CLIENT should help.
Do all the local nodes log to a single mqtt server? That might help with finding the source of your loops.
no, do not not use MQTT uplink for anything but position info, I think. And we try to avoid MQTT downlink totally.
the only solution is to try to reach everyone possible. Using hop limits to artificially limit distance seems to cause discombobulation for people in-between who can hear one recipient but not the other.
yeah, I understand ya, hear ya. We have sort of come to accept that. In emergency comms, important stuff will be copied and transmitted once more by people at the edges. That is our hope, instead of increasing any given hop's range a lot.
Thank you for your response.
Because CLIENT checks if any other node has already retransmitted a packet, and, if yes, Client does not retransmit, in our dense scenario, CLIENT should help.
True - although ROUTER also makes sure it only sends the message once unless there is big delay somewhere.. which is the part I'm trying to understand how these delays could happen.
True - although ROUTER also makes sure it only sends the message once unless there is big delay somewhere.. which is the part I'm trying to understand how these delays could happen.
I bet on ROUTER_LATEs releasing some old packet when it's finally clear in channel and txlimit...
[offtopic] As I observe our network, CLIENT_BASE seems to do the trick of ensuring all in-house/in-car portable CLIENT_MUTEs get messages received. I vote however on separate flags, and finally not using 'favorite' for it, but other binary flag with function of 'always relay to this flagged node' and 'always realy from this flagged node'. One byte can store 8 flags, so not real overhead here.
although ROUTER also makes sure it only sends the message once
that never worked in our earlier scenarios. All blue ROUTERS this summer sent at the same time, traceroute went from ROUTER to ROUTER a kind of loop, when the whole middle land / plains up north was open. That led to Clients Red and Green that could see each other (local line of sight was verified) not communicating with each other and, due to hop consumption by the Routers, never delivering packets, no contact possible. This situation before reconfig of blue high-mountain nodes to Router_Late:
Thus, at least with our one frequency slot only for all presets here in Europe on EU_868, we have come to realize that it is much better to allow Clients to fill in gaps, with Router_Late only for key mountain nodes in Schwyz and Nidwalden and leaving nodes in Zug and toward lake Zurich, at about 900m altitude, at mode Client.
[offtopic] As I observe our network, CLIENT_BASE seems to do the trick of ensuring all in-house/in-car portable CLIENT_MUTEs get messages received. I vote however on separate flags, and finally not using 'favorite' for it, but other binary flag with function of 'always relay to this flagged node' and 'always realy from this flagged node'. One byte can store 8 flags, so not real overhead here.
yes, I agree, separate flags for CLIENT_BASE special behavior nodes ... also for 0-cost hop FROM-nodes, would be better than using is_favorite.
Also, since we started, in our dense network, recommending CLIENT_MUTE as default for nodes, then CLIENT for only well-positioned roof or hill nodes, and with our high-mountain ROUTER_LATE nodes, we seldom have the scenario anymore where we "absolutely cannot get out", which was the reason for CLIENT_BASE in the first place ...