[Bug]: Traceroutes via ROUTER_LATE node dont end up in the TX Queue after modification (to add self hop in rebroadcast) on return path!
Category
Other
Hardware
Linux Native
Firmware Version
2.5.20
Description
When doing a traceroute via a ROUTER_LATE node, traceroutes are seen leaving, and coming back, but not modifying and putting itself into the TX queue to rebroadcast, therefor traceroute never gets returned to source node - so a return path only issue.
Captured in DEBUG Log on node setup in ROUTER_LATE
Relative Isolated Test Environment, If i take source node for a walk to where it does in fact get direct sight to another CLIENT node or the destination node directly, traceroute works from a RAK4631 in CLIENT_MUTE and receives the response.
Relevant log output
Thanks @Talie5in - I will try to tackle this tomorrow or Thursday. I need to dig into the traceroute code anyway for #5534, so this is good additional motivation for me to do so!
Have managed to replicate the lack of traceroute response via ROUTER_LATE. Now I just need to figure out why it's happening...
@Talie5in Which exact commit were you using when testing this? I can't find the string "Incoming msg will be filtered, from" from your log in the source code. So I'm not sure where that is coming from, but a bit later it mentions cancelSending id=0x827fe923, removed=1 meaning it removed it from the Tx queue.
It seems to be coming from your modified firmware: https://github.com/Talie5in/mt-device-firmware/blob/02b2ee8883663618a0c6319fc37fe137a6d2ac25/src/mesh/Router.cpp#L574
I believe this is your issue. You're canceling a packet in the Tx queue when another arrives. For ROUTER_LATE this is more likely to happen as it delays the rebroadcast.
I wonder what it was that I was reproducing then? Because I can get the behaviour to recur here.
@GUVWAF Yup, appears that is the culprit in those logs - just got around to retesting this (back to 2.5.20.4c97351) and i do eventually get the TR back (which is valid and inline with ROUTER_LATE) - i was switching between that build and the original meshtastic release while testing things - didnt realise I didnt do it on the right build at the time.
Apologizes for delayed response.
However I am still getting some that just never make it back (but do see them hit the device in the debug logs, just never make it to the source device), but curious if that's just hitting some kind of "took to long for a result so I stopped tracking the traceroute".
@erayd Not sure if you've come across anything further?
If I can find more time in the coming week i'll trail logs between router_late on the roof and the node on my desk and see if I can line them up for a submission.
Not sure if you've come across anything further?
Not yet, but I haven't yet had the opportunity to watch the logs of a ROUTER_LATE in a location where there's no other path back to my test node. Downside of having a mesh with quite good coverage.
It's easy enough to engineer the no-response thing by just going to one of the infill areas. But I can't watch the logs at the same time. Need to find a time to enlist help I think. Get someone else to run the traces while I sit up at the RL site and watch the logs.
Just thinking out loud: I wonder if you could reproduce it at home with a three node test setup. The three nodes on their own frequency slot, tx power turned down, with nodes A and C placed far enough apart to ensure that they hop through B.
@Talie5in have you been able to reproduce this with stock firmware? I'm seeing some issues on 2.5.20 with a client sending a traceroute through a ROUTER_LATE where I don't get the traceroute responses on my client, but the router node updates its nodedb with the traceroute target almost immediately - so I assume it is seeing the response, just not rebroadcasting it to my client.
Unfortunately my router node is in a place that I can't tail logs from, so its difficult to diagnose this further.