firmware Next Hop-based routing with fallback to managed flooding

Description

This adds a NextHopRouter for direct messages, which only relays if it is the next hop for a packet. The next hop is set by the current relayer of a packet, which bases this on information from a previous successful delivery to the destination via flooding. Namely, in the PacketHistory, we keep track of (up to 3) relayers of a packet. When the ACK is delivered back to us via a node that also relayed the original packet, we use that node as next hop for the destination from then on. This makes sure that only when there’s a two-way connection, we assign a next hop. Both the ReliableRouter and NextHopRouter will do retransmissions (the NextHopRouter only 1 time). For the final retry, if no one actually relayed the packet, it will reset the next hop in order to fall back to the FloodingRouter again. Note that thus also intermediate hops will do a single retransmission if the intended next-hop didn’t relay, in order to fix changes in the middle of the route.

It is backwards compatible with all 2.x versions of Meshtastic, because for those nodes it will always fall back to flooding.

Implementation details

The next_hop and relay_node are added to the unencrypted PacketHeader. Since we only used 14 bytes out of the 16-byte header (it got this size due to memory alignment), the two unused bytes are used to save the last byte of the NodeNum of the next hop and current relayer. When hop_start was added (version 2.3), these bytes were set to 0, so we can use these bytes safely when hop_start is set. The re-use of header space means that there is no additional over-the-air overhead, except that up to 2 retransmissions are needed when a change of next hop occurs. So, the benefit will be most pronounce for rather static meshes, but since the next hop is only set after a successful two-way connection is set up, and it falls back to flooding rather quickly, even for dynamic meshes there is likely a benefit to using this.

In terms of memory overhead, the next_hop has to be stored in the NodeDB (only 1 byte), and there are 4 additional bytes required per packet in the PacketHistory to store the next_hop and three relayers.

Using only 1 byte for the next_hop means that there is a chance that the last byte of two nodes match, and then they will both try to relay. However, that’s not a big issue as that would be similar to flooding. The chance that only the intended next hop relays depends on the amount of nodes that can hear the packet. If that are 10 nodes (that would be a lot), the chance that only the next hop relays is 83.7% ((255/256*254/256*253/256 ... 247/256*)*100), for 5 nodes the chance is 96.1% ((255/256*254/256*253/256*252/256)*100).

Examples

With this, you get rid of the unnecessary rebroadcasts caused by flooding like the one from node 0 as shown below. (Note that an arrow to a node means that it received it (so there might be multiple arrows for one packet), but not necessarily that it was addressed.) With the NextHopRouter, node 0 doesn't try to relay:

Furthermore, it solves this issue of the current implementation where the wrong node (1, because it has the lowest SNR) is relaying: With the NextHopRouter, due to randomness in the order of rebroadcasting, at some point the route will succeed and 2 will be set as next hop from then on.

Notes for reviewers

NextHopRouter inherits from the FloodingRouter. Since it also requires retransmissions, this logic is now moved from the ReliableRouter to the NextHopRouter, and the ReliableRouter inherits from the NextHopRouter. Also, since the Router and NextHopRouter need to have access to the PacketHistory as well, the Router now inherits from it instead of only the FloodingRouter.

Oct 02 '23 17:10 GUVWAF

I have started looking. May take me some to to finish. I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

Oct 02 '23 19:10 loodydo

🤖 Pull request artifacts

file	commit
`pr2856-firmware-2.2.10.44dc270.zip`	44dc270c8a72d6af0ebcf55bcfcb060638f691ef

Oct 03 '23 13:10 github-actions[bot]

I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

It depends on the amount of nodes that can hear the packet (so the amount of immediate neighbors). But indeed, the chance that then only the intended next hop relays is a bit different than I presented before. I've updated the description with two calculations.

Oct 03 '23 17:10 GUVWAF

I pushed a new commit to resolve @loodydo’s comments. Also added the hop limit setting of the original transmitter to the header flags (using 3 of the 4 currently unused bits), such that we can determine the amount of hops the packet has already traveled. This comes in handy to set the next hop for immediate neighbors, and I think it’s also useful for displaying in the apps. Currently it will always show the SNR/RSSI of nodes in the list, but this is actually the SNR/RSSI to the last relayer. In cases that a node is not an immediate neighbor, we could display the hops towards that node.

I also changed the way how the next_hop and relay_node are stored in the MeshPacket and NodeDB. Now only the last byte is stored, since only that byte is sent over the air. Besides, it mitigates the case where you might assign the wrong NodeNum if you happen to have multiple nodes with the same last byte in the NodeDB.

Oct 13 '23 19:10 GUVWAF

I think this PR is ready to merge, except that it’s a breaking change unfortunately. I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message. We’ll have to wait for 3.0 with this.

Nov 26 '23 22:11 GUVWAF

If you use 1 byte, then I suggest modulo 251 of the whole name represented by UINT_64. The probability is then 0.4%

May 07 '24 20:05 roha-github

@roha-github What do you mean with "whole name represented by UINT_64" and which probability are you referring to?

May 07 '24 20:05 GUVWAF

@GUVWAF

"I used the 2 unused bytes to save the last byte of the NodeNum of the next hop and current relayer" ...

don't "last byte of the NodeNum", use NodeNum modulo 251 (max 1 byte prim number) as filter

May 07 '24 21:05 roha-github

The NodeNum is only 32 bits and based on the MAC address of the device. Assuming that that number is uniformly distributed I don't get why using modulo of the largest prime number is a better approach.

Where does your 0.4% probability refer to?

May 07 '24 21:05 GUVWAF

UINT32 modulo 251 = 0..250 1/251 = 0.4% Using prim numbers for checksum is a common pattern

May 07 '24 21:05 roha-github

I'm sorry but I don't get why that is better than using the last byte, which is 0..255.

It's not a checksum, it's to uniquely identify a node. If the NodeNum is already uniformly distributed, taking the last byte should not have any bias.

May 07 '24 21:05 GUVWAF

About the routing itself ...

Each node could regularly publish a broadcast of its direct neighbors and communicate the reception quality of RSSI and SNR.

The list is sorted by reception quality and only the best 32 nodes are published.

1 byte or 8 bits are used per node: Bit 1 = RSSI quality > -115 dB Bit 2 = SNR quality > -11 dB Bit 3..8 = NodeNum mod 64

Together with the node DB and the GPS coordinates, either the node itself or the app (smartphone, web browser, CLI) can then create a network topology.

Only 7 bytes would be required for a route with 7 hops

May 07 '24 21:05 roha-github

That's what the neighborinfo module does.

May 07 '24 21:05 caveman99

That's what the neighborinfo module does.

Ok, now I understand the benefits of the Neighbor Info Module.

Should the node do the routing itself later? Does the node DB then store all the neighbors of each node? How much memory would this require? I once read somewhere that the node DB would only store up to 80 nodes.

The documentation on the Neighbor Info Module does not really encourage activation. The module would be needed for routing and ever larger mesh networks.

May 07 '24 22:05 roha-github

I think this PR is ready to merge, except that it’s a breaking change unfortunately. I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message. We’ll have to wait for 3.0 with this.

I do not understand the "wait for 3.0", most likely I don't understand what deploying 3.0 means... I would assume that nodes in a mesh will be upgraded slowly, especially big meshes, like SoCalMesh with hundreds of nodes, some in pretty inaccessible locations. This means that whether it's called 3.0 or not the next hop routing will have to co-exist with 2.x. Unless you count on all the 2.X having been upgraded to a version that sets next_hop to zero by then...

I'm wondering whether it would be possible to use the NeighborInfo to gate next-hop routing. If the NeighborInfo can have a flag that says "I do next-hop routing" then on the outgoing side a relayer knows whether populating next-hop makes sense or not.

On the incoming side it's a bit trickier in that a relayer would look up the NeighborInfo based on relay_node. If relay_node matches a NeighborInfo that says it does next-hop routing then the relayer would obey next_hop, although of course the match may be by chance. As a special case, if hop_start is set and equal to hop_limit then IIUC the message is received from the original sender and thus NeighborInfo can be looked up using the from address.

In any case, it seems worth it to figure out some heuristics to make 3.0 interoperable with pre-2.3, which introduced the explicit next_hop field as far as I can tell and not rely on "it's going to take so long for 3.0 to become available that everyone will have upgraded past 2.3 by then"...

Apologies if I'm completely off-base and don't understand some key fact here.

May 19 '24 05:05 tve

Shouldn't the NeighborInfo module be enabled by default?

May 19 '24 05:05 tve

Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast) or another sync word to deliberately make nodes incompatible.

I'm not really sure half-working interoperable solutions are the way to go here, especially if it requires the NeighborInfo module to be enabled. It's not that this PR has a huge impact anyway, as it only applies to DMs while I think the majority of packets are still broadcasts.

NeighborInfo creates quite a lot of additional traffic and is not necessary for NextHop routing, so I don't think it should be enabled by default.

May 19 '24 13:05 GUVWAF

Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast) or another sync word to deliberately make nodes incompatible.

@GUVWAF I'd like to chime in here and say that I think this part has some major downsides. This threatens to split deployed infrastructure between old and new and drop coverage levels for both old and new until everything, including all client devices, have been changed over. For anyone who has gone further with deployment than just tinkering, this could be a major undertaking just due to the number of devices. And for anyone with infrastructure on mountaintops or other places which are difficult to access, these devices can be a pain in the neck to try to update. The more people using the same settings, the better it is for everyone's coverage, and the experience for new users would be better if they already have an existing network to connect to without having to build it all out themselves. Changing the default modem preset should not be done lightly.

And I agree with @tve's comments about version interoperability. A thoughtful and gradual upgrade path would make it far more likely for people to actually upgrade. Ideally you want people to upgrade to the new version so that you can continue to innovate moving forward. If the changes are too drastic, you risk having people stuck with "good enough, works for me" instead, or even potentially having someone maintain an old compatible version as an LTS release. Or looking at the project/protocol as so unstable that it's not worth continuing to deploy.

I love the idea of this, but a gradual rollout would make it a much easier pill to swallow.

May 19 '24 18:05 KyleMaas

Ofcourse the decision will not be taken lightly and people have been saying it will be happing at least a year from now.

I'm not entirely following why you're advocating for gradual changes, as precisely that leads to "good enough, works for me", instead of doing all breaking changes at once, where you have to upgrade to continue to talk to others that have.

May 19 '24 18:05 GUVWAF

I love the idea of this, but a gradual rollout would make it a much easier pill to swallow.

This pull is not merged because it is a breaking change that needs to wait until 3.0, and there is not enough other stuff to make a breaking change worth it. There is no way to do this gradually. Meshtastic is just not set up for multiple versions, when version 3 gets promoted version 2 is going to be deprecated.

May 19 '24 19:05 garthvh

There is no way to do this gradually.

I'm not convinced, that's why I made suggestions above. Yes, there's no way to make the changes without compromises, but I don't know that the compromises have been explored or discussed. Quite possibly just my ignorance.

Meshtastic is just not set up for multiple versions, when version 3 gets promoted version 2 is going to be deprecated.

This is one of the biggest issues I have with the proposal: it uses up all the unused bits for its purposes and doesn't fix the versioning issue which causes the incompatibility nightmare. This kind'a guarantees that no more non-breaking changes can be made going forward which I'd rate as "not good".

At a high level I'm wondering what the plan is for wide-area meshes, which is where the routing improvements have the most impact I believe. I'm part of the southern california mesh with hundreds of nodes spread over hundreds of square km, many mountain ranges dividing the geography, and many nodes in difficult to access/update places. My overall assessment is that the mesh as a whole is pretty unusable other than to toy around and experiment (which is cool). If the goal is to make this type of mesh usable over longer distances (like sending messages from Santa Barbara to San Diego -- approx 350km) some serious thought needs to go into routing. Maybe the answer is that this should be done via MQTT (internet), maybe the answer is that there should be an overlay backbone network to bridge long distances, maybe the answer lies in just improving the current type of routing. I don't know and it also depends a lot on what the project goals are.

I'm not sure what the best place for this type of discussion is, but I believe it's relevant here because if there's going to be a breaking change then at least make the most out of it. I'm also not trying to piss on this PR: I actually think it's great, I'm just wondering about the bigger picture.

May 19 '24 19:05 tve

This is one of the biggest issues I have with the proposal: it uses up all the unused bits for its purposes and doesn't fix the versioning issue which causes the incompatibility nightmare.

But this is really beyond the scope of this PR. Currently these two bytes are useless anyway, because before 2.3 they could take any value depending on what was in that memory location. This is just one of the breaking changes currently proposed for 3.0 (fixing text compression is another one), I believe a change to the header to make it future-proof deserves its own PR. Edit: I now see @KyleMaas already opened a discussion for this: https://github.com/meshtastic/firmware/discussions/3932

My overall assessment is that the mesh as a whole is pretty unusable other than to toy around and experiment (which is cool).

What kind of issues are you having and what makes you believe this has to do with routing? For broadcasts the current managed flooding approach is really not too bad. I'm not convinced there even exists a protocol that's guaranteed to be better in all the different use cases for Meshtastic (especially fixed or mobile nodes, or a mix), that doesn't require an internet connection and can run on microprocessors with limited RAM. For "smarter" routing you need periodic control messages, but those eat up the same bandwidth as regular traffic.

Other problems with wireless mesh networks, e.g. the hidden node problem or unstable/changing RF conditions are not solved with better routing.

May 19 '24 23:05 GUVWAF

@GUVWAF Yeah, starting that discussion to ask about forward compatibility is what led me into joining the discussion on this issue. I have several places in mind to add nodes that would vastly increase coverage in my area, but they are inaccessible enough that once they're up there, I'm probably realistically only going to be able to update the firmware maybe every 12 months or so. So I need to wait for a version of the firmware that has enough interoperability, or have all of the proposed protocol changes interoperable enough with older versions, that I can actually make these nodes available to the community. All of the current nodes I run are in areas which are at least reasonably accessible. But there's no sense in me taking on the time, effort, and expense of building out the expansions if it's going to be completely obsolete in a month - it makes more sense to wait for stability and not recommend Meshtastic to anyone who isn't already in our local coverage zones. So protocol stability and interoperability is extremely important to me.

May 20 '24 01:05 KyleMaas

3.0 is like 18 months away best case. Honestly not really enough there yet to even consider it.

May 20 '24 02:05 garthvh

@GUVWAF "Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast)"

Where are such far-reaching decisions discussed and decided? What exactly is the background? The minimal increase in bandwidth is "bought" with a loss of range and all 2.x routers & repeaters.

Alternatively, 3.x clients switch to Long / Fast or both networks run in parallel and apart.

May 20 '24 06:05 roha-github

Where are such far-reaching decisions discussed and decided?

This was discussed by admins on Discord, but is also just an idea right now. When 3.0 comes closer, there will likely be an RFC for this. Let's now continue this discussion in #3932.

May 20 '24 12:05 GUVWAF

After looking at this again, I realized I made a big assumption that’s likely often not valid. I set the next_hop of a node to the one that was forwarding the ACK/response of a DM for that node. However, with asymmetric links, it might well be that that node is not the best next_hop, and it might even be you can’t reach it, while it can reach you. In my simulator I don’t simulate asymmetric links, but judging from how often people mention they receive “beacons” very well, but can’t send a message to someone, I think this is quite common.

Another thing I didn’t consider is that only the original transmitter will fall back to flooding when it doesn’t hear anyone rebroadcast, but it might also be that intermediate hops loose their next_hop link. In my fork I’ve fixed this by letting the intermediate hops also retransmit once with fall back to flooding, but I’m unsure about the possible overhead this creates.

At least before the first issue is resolved, I’m not confident on merging this yet. Happy to hear any suggestions on this.

Aug 18 '24 13:08 GUVWAF

firmware firmware copied to clipboard

Next Hop-based routing with fallback to managed flooding

Description

Implementation details

Examples

Notes for reviewers

🤖 Pull request artifacts

firmware
firmware copied to clipboard