firmware icon indicating copy to clipboard operation
firmware copied to clipboard

If a packet is heard multiple times, rebroadcast using the highest hop limit

Open erayd opened this issue 1 year ago • 18 comments

Sometimes a packet will be in the TX queue waiting to be transmitted, when it is overheard being rebroadcast by another node with a higher remaining hop limit. When this occurs, modify the pending packet in the TX queue to avoid unnecessarily wasting hops.

This is intended to help solve the problem of a router node with excellent coverage (A) hearing a packet bounce around between several poorly-placed nodes while waiting for the opportunity to rebroadcast. On a busy mesh, this can mean that the first copy of the packet that (A) hears may have had its hop limit prematurely reduced, and a better version is subsequently received.

This is more relevant to the late-rebroadcast scenario implemented in https://github.com/meshtastic/firmware/pull/5528, but is still applicable even without that mode, particularly when having to contend with poorly-placed rooftop routers.

For example:

  1. (E) transmits a packet with hop limit 3
  2. (B) & (D) both overhear it. (D) starts rebroadcasting it first with hop limit 2, (B) overhears (D) and waits.
  3. (B) and (C) are both ready to transmit. (C) happens to go first, rebroadcasting with hop limit 1.
  4. (A) hears the packet from (C), and is about to rebroadcast with hop limit 0, when it hears (B) starting to send.
  5. (B) rebroadcasts the packet with hop limit 2.
  6. (A) hears the packet again from (B), this time with hop limit 2.
  7. (A) adjusts the hop limit on the packet in its TX queue from 0 to 1, and then transmits the packet.

example

erayd avatar Dec 09 '24 09:12 erayd

This makes sense to me

fifieldt avatar Dec 09 '24 09:12 fifieldt

Implementation looks good, but I'm not entirely sure about the idea behind it, as it effectively means you're artificially increasing the hop limit for certain branches of the mesh. Not sure this is necessarily a bad thing, as it wouldn't happen if the optimal route was chosen in the first place, but it does go against the idea of ​​limiting the amount of rebroadcasts by setting a proper hop limit.

Also, a small issue with this is that a traceroute result will show a route with more hops than the hops_away accompanying the packet, which is also used to add “unknown” nodes to the route.

GUVWAF avatar Dec 09 '24 13:12 GUVWAF

I'm not entirely sure about the idea behind it, as it effectively means you're artificially increasing the hop limit for certain branches of the mesh. Not sure this is necessarily a bad thing, as it wouldn't happen if the optimal route was chosen in the first place, but it does go against the idea of ​​limiting the amount of rebroadcasts by setting a proper hop limit.

The way I see it, is that it is correcting an unintended (and suboptimal) lowering of hop count for those sections. Two different ways of looking at the same situation perhaps? This wouldn't change the number of rebroadcasts in that area, but it would allow the packet to ultimately travel further, rather than getting lost between high sites due to routing shenanigans at the origin area.

Also, a small issue with this is that a traceroute result will show a route with more hops than the hops_away accompanying the packet, which is also used to add “unknown” nodes to the route.

Thanks for flagging this. I do not yet understand how traceroute is implemented (have yet to investigate that part of the source), so will dig into that to ensure it is treated properly.

erayd avatar Dec 09 '24 13:12 erayd

This wouldn't change the number of rebroadcasts in that area

For this specific packet, it does (otherwise this won't have any effect).

Two different ways of looking at the same situation perhaps?

I guess it depends on whether users are already accounting for this or not. Even with this solution, consuming hops too early might still happen if the packet already left the Tx queue, or if the first packet arrived with a hop limit of 0. So, if users set their hop limit based on the maximum they have seen, this only leads to more unnecessary rebroadcasts. Only if they set it exactly to the one needed for the optimal route, then this would indeed higher the success rate without any additional unnecessary rebroadcasts.

GUVWAF avatar Dec 09 '24 14:12 GUVWAF

I do not yet understand how traceroute is implemented

It will be called upon as a module, and then modifies the payload, before adding it to the Tx queue again. When you see a duplicate packet, it will be filtered out and thus the modules are not called for it, so the original traceroute result will be in the Tx queue. I've struggled with this before, since you need to be careful with handling duplicate packets multiple times, for example because it shouldn't be sent to the phone and MQTT twice.

GUVWAF avatar Dec 09 '24 14:12 GUVWAF

For this specific packet, it does (otherwise this won't have any effect).

Can you clarify? I'm not seeing where the extra rebroadcast is in the local area. The only additional rebroadcast should be out of the problem area, by whichever relaying device would otherwise have seen a hop limit of 0 (in this case, A).

The fundamental question is, do you see this behaviour as desirable or undesirable in the general case? If the latter, would you be more comfortable if I implemented this as part of #5528, to apply only to ROUTER_LATE?

I think it's definitely warranted in the ROUTER_LATE case, because those nodes are intended to be serving isolated pockets of nodes that do a poor job amongst themselves first. But with other roles, the usefulness is much more marginal.

erayd avatar Dec 09 '24 22:12 erayd

It might just be we’re not agreeing on what the “(local/problem) area” is. In my opinion it doesn’t really matter where the additional rebroadcasts happen, if they are unnecessary then it should be avoided. It is unnecessary when an end-node rebroadcasts even though there is no new receiver, but it is especially relevant for direct messages, where most nodes do not need to receive the packet at all. For example here, if node 0 sends a DM to node 1 with a hop limit of 2, it also goes to node 2 and 3, and ends up with a hop limit of 0 at node 4. However, when it arrives later via 5, with this PR node 4 would still rebroadcast.

image

do you see this behaviour as desirable or undesirable in the general case?

If users are conservative in setting the hop limit, then it would be desirable, but in my experience this is mostly not the case.

I think it's definitely warranted in the ROUTER_LATE case, because those nodes are intended to be serving isolated pockets of nodes that do a poor job amongst themselves first.

The ROUTER_LATE should serve isolated pocket of nodes that did not hear the packet yet. The first packet that arrives at the ROUTER_LATE should not be coming from those isolated pockets of nodes that consume too many hops.

But with other roles, the usefulness is much more marginal.

I agree this will happen more frequently if there are ROUTER_LATE roles in the mesh. However, it will then also affect ROUTERs and REPEATERs, because even when the route via a ROUTER_LATE uses less hops, its packet likely arrives later than a suboptimal route.

Maybe we can come to a compromise and only introduce this after ROUTER_LATE and the NextHopRouter (#2856) are introduced? With the NextHopRouter unnecessary relaying for DMs will already be limited, while using this PR, for broadcasts the reachability might in some cases be improved.

GUVWAF avatar Dec 10 '24 18:12 GUVWAF

It might just be we’re not agreeing on what the “(local/problem) area” is.

I think that may well be the case. I'm looking at this from the principle of "The packet should be spread as widely as possible across the mesh with the fewest number of hops". In that context, what I consider to be unnecessary rebroadcasts are those generated by clients who missed seeing a prior nearby repetition of the packet (e.g. due to terrain obstructions). Whereas you seem to be coming at it from the perspective of "Users are probably trying to reach only people very close to them, and packets should be discouraged from travelling longer distances." (unless I have badly misunderstood your example scenario in the above post).

In my opinion it doesn’t really matter where the additional rebroadcasts happen, if they are unnecessary then it should be avoided.

Agreed. However, currently we do not have a way to divine the user's intent - they might be trying to reach node 1, but they may also be wanting to reach node 6, or perhaps even something further away than that which would require relaying via (6). This is particularly the case with broadcast packets (e.g. if (6) and a closer node share a common channel, the user may desire them both to receive the packet). In the common channel case, the user may not even know how far away (6) is, or whether they're a member of that channel.

With DMs, it's more straightforward - the destination node ID is in the packet header. Albeit that we don't (yet) have a mechanism to determine where on the network that node might be located, and which hops may be necessary to reach it. Looking forward to #2856 for that!

do you see this behaviour as desirable or undesirable in the general case?

If users are conservative in setting the hop limit, then it would be desirable, but in my experience this is mostly not the case.

Is it not that users bump the hop limit up because they are struggling to reach their target otherwise? I know that is certainly the case at the edges of our local mesh here... distant users increase their hopcount, because otherwise their packets don't reliably make it through the mess of mountains & hills to the person 400km away that they're trying to communicate with.

I think it's definitely warranted in the ROUTER_LATE case, because those nodes are intended to be serving isolated pockets of nodes that do a poor job amongst themselves first.

The ROUTER_LATE should serve isolated pocket of nodes that did not hear the packet yet. The first packet that arrives at the ROUTER_LATE should not be coming from those isolated pockets of nodes that consume too many hops.

This makes no sense. ROUTER_LATE does not exist as a one-way thing to only deliver packets to those pockets of users. For it to properly serve those users, also exists to receive packets from them and relay those out to the rest of the mesh. In many cases, due to the topography, that node will be the only path available for such traffic to take.

But with other roles, the usefulness is much more marginal.

I agree this will happen more frequently if there are ROUTER_LATE roles in the mesh. However, it will then also affect ROUTERs and REPEATERs, because even when the route via a ROUTER_LATE uses less hops, its packet likely arrives later than a suboptimal route.

By the time the ROUTER_LATE version of a packet arrives, anything with a ROUTER or REPEATER role would have already rebroadcast that packet (and will therefore completely ignore the ROUTER_LATE version) unless the one from ROUTER_LATE is the first time it was received. This optimisation is only relevant to packets that are already sitting in the TX queue - it doesn't retain them there to wait for better options, and it doesn't rebroadcast it a second time. This behaviour is purely opportunistic. Hence why I say the benefits are more marginal for other roles, because with other roles things get flushed out of the TX queue much more rapidly. The sooner they move, the less this helps.

Maybe we can come to a compromise and only introduce this after ROUTER_LATE and the NextHopRouter (#2856) are introduced? With the NextHopRouter unnecessary relaying for DMs will already be limited, while using this PR, for broadcasts the reachability might in some cases be improved.

I'd be OK with that 🙂. It's a feature I'm keen to have available on our local mesh sooner than that - because it's clear from experience that we definitely do need it - but we can always just use custom binaries on our sites where that matters, so waiting until after #2856 and #5528 are merged won't cause us any problems.

erayd avatar Dec 11 '24 10:12 erayd

"Users are probably trying to reach only people very close to them, and packets should be discouraged from travelling longer distances." (unless I have badly misunderstood your example scenario in the above post).

My point is that the hop limit is a global setting, it applies to every packet, so you have to set it to the maximum you expect you need. However, it is also used for the case where you send a direct message to a node close to you.

ROUTER_LATE does not exist as a one-way thing to only deliver packets to those pockets of users. For it to properly serve those users, also exists to receive packets from them and relay those out to the rest of the mesh.

Ah, I was thinking about the use-case for a ROUTER_LATE on a rooftop that serve CLIENT_MUTE devices in the house, but yes, when they are normal CLIENTs and the packets go the other way around, you’re right.

a ROUTER or REPEATER role would have already rebroadcast that packet

Yes, if it is in front of the Tx queue, and there will not be any new packet received in between from another ROUTER/REPEATER or a node that didn’t hear the first packet (hidden node problem).

Anyway, let's merge this after ROUTER_LATE and the NextHopRouter (and hopefully after fixing the traceroute issue).

GUVWAF avatar Dec 11 '24 19:12 GUVWAF

Anyway, let's merge this after ROUTER_LATE and the NextHopRouter (and hopefully after fixing the traceroute issue).

Sounds good 🙂. And yes, I agree that the traceroute issue you've identified is something that I should resolve before this is merged.

erayd avatar Dec 11 '24 19:12 erayd

@caveman99 Please don't force-push my branches. No issue with you keeping it up to date if you want to, but unnecessarily breaking sync is irritating. Merge commits work just fine.

erayd avatar Dec 12 '24 17:12 erayd

@caveman99 Please don't force-push my branches. No issue with you keeping it up to date if you want to, but unnecessarily breaking sync is irritating. Merge commits work just fine.

This is our process, having one branch managed differently than the others is not really going to work.

garthvh avatar Dec 12 '24 17:12 garthvh

@garthvh

This is our process, having one branch managed differently than the others is not really going to work.

#5528 indicates otherwise, as you folks have been using merge commits there without issue. And on several other PR branches, including some of them by @caveman99. So using merge commits is clearly accepted practice here.

Why is doing things in a way that doesn't break sync OK for those branches, but not this one?

erayd avatar Dec 12 '24 17:12 erayd

Preferred is rebase, sometimes you can't rebase, then we use merge commits.

caveman99 avatar Dec 12 '24 18:12 caveman99

@caveman99

Preferred is rebase, sometimes you can't rebase, then we use merge commits.

Roger. Looking at the PR list though, I'm still seeing a fair bit of merge activity in cases where a rebase would have worked fine (including my initial example of #5528) - so it doesn't seem that this preference is being adhered to in practice across the board. If that is something you want to insist on though, then I will just put up with the annoyance as the price of engaging with this project.

Rebase-first-by-maintainers for regularly keeping every PR updated on a repo this busy does seem an odd way of doing things IMO. Given the frequency of updates, you'd be breaking things on a near-constant basis. If this preference is something you want to insist on, might I suggest that you mention it in the default notes when somebody creates a PR? Currently it just requests that editing by maintainers be enabled, but makes no mention that one's work might get repeatedly rebased without warning. That is a useful thing to know upfront, especially for PRs that may involve a notable number of distinct commits between pushes.

erayd avatar Dec 12 '24 19:12 erayd

@GUVWAF What do you think of the new approach as a way of resolving the traceroute issue?

It seemed a bit dirty IMO to put exceptions in there for traceroute, so now it simply reprocesses the packet if it arrives again with a better hop limit while still in the TX queue (instead of modifying the hop limit of the existing queued version).

erayd avatar Dec 28 '24 04:12 erayd

@erayd Unfortunately this won't work as it will lead to duplicate packets in the apps and on MQTT as I mentioned above. Even if it's still in the Tx queue, it has already been sent to the phone here: https://github.com/meshtastic/firmware/blob/b2808063d0488bcd2a2c3b17aa134ecf865513eb/src/modules/RoutingModule.cpp#L32 and to MQTT here: https://github.com/meshtastic/firmware/blob/b2808063d0488bcd2a2c3b17aa134ecf865513eb/src/mesh/Router.cpp#L280

GUVWAF avatar Dec 28 '24 08:12 GUVWAF

@GUVWAF Roger, and thanks. Will keep thinking...

erayd avatar Dec 28 '24 09:12 erayd

Hello, I opened a ticket recently that matches this implementation. https://github.com/meshtastic/firmware/issues/7572 Is there any development on-going on your side? I can try to help, but I've not quite understood the traceroute issue.

foch avatar Aug 22 '25 10:08 foch

Is there any development on-going on your side?

I had forgotten I had this one sitting there! Will try to pick it back up again soon.

I can try to help, but I've not quite understood the traceroute issue.

The issue is that a traceroute response arriving later with more hops left on it contains different data than the packet which is currently sitting in the queue. So it's not enough to simply modify the hop limit of the already queued packet (as we do with other packets), but rather the traceroute packet needs to be re-processed or modified accordingly.

erayd avatar Aug 25 '25 00:08 erayd

Summary

  • Rework hop-limit upgrade handling so routers swap in better packets without losing duplicate suppression.
  • Use packet history to save the best hop limit seen (This will increase memory 3 bytes/packet)
  • Improve TX queue tooling/logging to make it obvious when and why entries are replaced

Details

Router hop-limit flow

  • Refactored FloodingRouter::shouldFilterReceived() and NextHopRouter::shouldFilterReceived() to call wasSeenRecently() once, capture the new wasUpgraded flag, and run upgrade replacement before the duplicate early-return. This lets the routers reuse the upgraded frame while still filtering true dupes.
  • Gate the upgrade path on IS_ROUTER_ROLE() and require hop_limit > 0, then call removePendingTXPacket() with the new packet’s hop limit so only worse queued copies are removed. Log messages now report the threshold being applied.

Packet history tracking

  • Extended PacketHistory::wasSeenRecently() with an optional wasUpgraded out-parameter and persisted the maximum hop limit observed for each packet. Seeing a higher hop count sets wasUpgraded, suppresses the duplicate fast path, and keeps the higher value even if later retransmissions arrive with fewer hops.
  • Updated PacketRecord metadata and comments to reflect the new “highest hop limit”

TX queue utilities and logging

  • Added MeshPacketQueue::getPacketFromQueue() so queue lookups can reuse the same search logic, and simplified find() to delegate to it.
  • Marked RadioLibInterface::removePendingTXPacket() as an override and switched the diagnostic to log packet IDs in hex, matching other debugging output.

Testing

  • Manual log inspection while passing upgraded hop-limit traffic using a production node set to ROUTER_LATE.

Example logs

DEBUG | 15:55:41 190 [Router] Packet History - Hop limit upgrade: packet 0x6a2f0d5d from hop_limit=1 to hop_limit=2
DEBUG | 15:57:45 315 [Router] Packet History - Hop limit upgrade: packet 0x69051dd8 from hop_limit=0 to hop_limit=1
DEBUG | 15:57:48 318 [Router] Packet History - Hop limit upgrade: packet 0x8f299f10 from hop_limit=1 to hop_limit=2
DEBUG | 15:58:36 366 [Router] Packet History - Hop limit upgrade: packet 0x0424a99c from hop_limit=0 to hop_limit=2
DEBUG | 15:55:41 190 [Router] Dropping pending-TX packet 0x6a2f0d5d with hop limit 0
DEBUG | 15:57:48 318 [Router] Dropping pending-TX packet 0x8f299f10 with hop limit 0
DEBUG | 16:06:25 834 [Router] Dropping pending-TX packet 0x60a989c2 with hop limit 3

🤝 Attestations

  • [X] I have tested that my proposed changes behave as described.
  • [X] I have tested that my proposed changes do not cause any obvious regressions on the following devices:
    • [ ] Heltec (Lora32) V3
    • [ ] LilyGo T-Deck
    • [ ] LilyGo T-Beam
    • [ ] RAK WisBlock 4631
    • [ ] Seeed Studio T-1000E tracker card
    • [X] Other (please specify below) Station G2

h3lix1 avatar Sep 18 '25 23:09 h3lix1

@GUVWAF Can you assign this to me so I can submit for review? Thank you.

h3lix1 avatar Sep 19 '25 00:09 h3lix1

@GUVWAF Can you assign this to me so I can submit for review? Thank you.

Sadly github doesn't allow assignments to people not in the organization

thebentern avatar Sep 19 '25 00:09 thebentern

@GUVWAF Can you assign this to me so I can submit for review? Thank you.

I have hit the review button. Can't assign it, but I can certainly click that on your behalf 🙂

erayd avatar Sep 19 '25 00:09 erayd

Ignore the force-commit sorry. I have force-reversed it, and all should be well again.

Was forcing a different branch for a different feature, and accidentally pushed my old work over the top of this one at the same time.

erayd avatar Sep 19 '25 10:09 erayd

...twice! 😭

Have fixed it again, and updated my local copy of it to stop that happening a third time. Apologies for the notification spam.

erayd avatar Sep 19 '25 10:09 erayd

@h3lix1 Now we're back to the original reason why this PR wasn't merged yet. Either we return false in wasSeenRecently() such that all logic gets redone, but that has the issue that it also leads to duplicates at the app. With the current approach, it is simply rebroadcasted, but then it is not handled by modules like TraceRoute.

I feel like we might want to use TransportMechanism to denote this is a duplicate packet and avoid sending it twice to the client, but I'm happy to hear other suggestions.

GUVWAF avatar Sep 20 '25 11:09 GUVWAF

@GUVWAF I went a different route, but one that I like better. This new route will re-run all the modules, except not send to the phone. After looking at it more, traceroute is only one of the things that might want to be updated with the updated information. Nodeinfo should also have the latest/best information as well.

All the modules (thankfully) handle this very well. Most key off of (from, id) which stays the same, and will just do an in-place replacement or StoreForward that has its own history queue.

Hopefully this is an acceptable alternate path.

h3lix1 avatar Sep 20 '25 17:09 h3lix1

Sorry for keeping you busy all the time ;) This looks quite good, but it’s not only sendToPhone() that's problematic. Calling the modules again also triggers things like the TextMessageModule, which also triggers e.g. the InkHUD that might display duplicate texts, for example.

I’m wondering if we really should only re-do traceroute, as I indeed can also not think of any other module that needs the double treatment. I looked at which modules implement alterReceivedProtobuf() and for the others they either use that for when receiving packets from the phone (Position, ATAK), or it will remain unchanged (NeighborInfo).

GUVWAF avatar Sep 20 '25 17:09 GUVWAF

@GUVWAF You're doing fantastic at finding all the cases I overlooked. I owe you a drink sometime. Maybe two.

Refactored to only handle the traceroute and nodedb update cases, since I really want the nodeDB to show the best number of hops. Agreed this is likely the better route, although elegance is a little lacking in nexthoprouter and floodingrouter's shouldFilterReceived ...

h3lix1 avatar Sep 20 '25 21:09 h3lix1