mtr
mtr copied to clipboard
Multiple instances of mtr may collide under certain circumstances
Hello.
Yesterday, when configuring telegraf to read the CSV output of multiple mtr instances to store the traceroute to a DB, I notice that the data collected is messy: the IPs of different hops in different mtr instances get mixed up. For example, the first hop, which should have been the home router (192.168.1.1), gets an IP of an intermediate hop on the route path.
After spending a few hours figuring out what's wrong, I realize that might be due to mtr. I have reproduced the problem on my machines/servers which are under different network environments & OSs.
Steps to reproduce:
- Install mtr v0.92/3 on a machine.
- Choose an IP neither too far from nor too close to the current machine. Generally, within 7-15 hops, 20-80ms RTT would fine. In my demo below, I use
www.japan.co.jp
. - Choose an IP that is far enough from the current machine so that the route path is not fixed for TCP traffic. Generally, >150ms RTT would be fine. In my demo below, I use
www.gov.de
. -
mtr
the first IP. Before exiting, remember its route path, which is expected to be fixed. -
mtr -T
ormtr -u
the second IP. Before exiting, be sure that its route path is dynamic in the sense that there can be multiple IPs within one hop. -
mtr
the first IP andmtr -T/u
the second IP simultaneously. - Result: the
mtr
output for the first IP is messed up.
Demo (terminal recording) on a fresh Vultr instance: https://asciinema.org/a/txzVrlO9LNGy2RVrk1BKCBiDn
In case it helps, here is my git bisect
result:
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
032d82f326d76ef1e064b5db68a2486275fa06b5
22b7454a2faed8f37b608cfff320e567c5f96ab6
ac58c7a4b744752958975b93d8f774572186a421
5d26cb0c0500b85f71a43194f090bb97064f71cf
6df4e45df4ae1504604d5eef1b0858e1cb6e42de
88d1a95087185339e439918a24923d5e0e816451
We cannot bisect more!
Is anything being done about this issue? It seems to be happening a lot.
Thanks
Just checking in to see if there has been any updates on this issue. Has anyone gotten any resolution on this?
Thanks
I just encountered this issue too.
Is it fixable?
The issue is that the internet protocols were not designed with the "mtr" use in mind. So the "error message" that the intermediate routers between your host and your target send back when MTR is doing its thing are more: of the sort: "Hey
There is just ONE piece of identifying informatiion that the routers are required to return with the error message. That's the source poirt number. So mtr uses that to determine on what sent packet the router's error response is being generated.
So when mtr is running it is using a significant number of ports is "allocated" to mtr. Say we need to probe 30 hosts away, say we don't want to reuse ports for 30 seconds... then we're already using 900 ports. There are only 65000, so according to this calculation we're already using a sizeable chunk.
It is possible that mtr will use a legitimate portnumber that another app is actively using. Then the system will recieve a "host unreachable" from an MTR probe and might then close that other program. We therefore need to use as little ports as possible.
(back in the late eighties there was a network card that had a firmware bug: the OS would configure it to only capture packets "for me" and "broadcast". The OS then assumed that all packets would indeed be "for me" or "broadcast". So when the card saw a packet "FROM hostA to hostB for port X" that packet would sometimes be passed on to the OS. The OS would then respond: "From hostC to hostA: No connection to port X exists at this computer!", and HostA would close the connection. So a third host sending error packets can be VERY annoying when it breaks up a legit connection!)
I think you can specify the range to use on the command line. Specifying different ranges will help.