mtr icon indicating copy to clipboard operation
mtr copied to clipboard

Multiple instances of mtr may collide under certain circumstances

Open Gowee opened this issue 4 years ago • 4 comments

Hello.

Yesterday, when configuring telegraf to read the CSV output of multiple mtr instances to store the traceroute to a DB, I notice that the data collected is messy: the IPs of different hops in different mtr instances get mixed up. For example, the first hop, which should have been the home router (192.168.1.1), gets an IP of an intermediate hop on the route path.

After spending a few hours figuring out what's wrong, I realize that might be due to mtr. I have reproduced the problem on my machines/servers which are under different network environments & OSs.

Steps to reproduce:

  1. Install mtr v0.92/3 on a machine.
  2. Choose an IP neither too far from nor too close to the current machine. Generally, within 7-15 hops, 20-80ms RTT would fine. In my demo below, I use www.japan.co.jp.
  3. Choose an IP that is far enough from the current machine so that the route path is not fixed for TCP traffic. Generally, >150ms RTT would be fine. In my demo below, I use www.gov.de.
  4. mtr the first IP. Before exiting, remember its route path, which is expected to be fixed.
  5. mtr -T or mtr -u the second IP. Before exiting, be sure that its route path is dynamic in the sense that there can be multiple IPs within one hop.
  6. mtr the first IP and mtr -T/u the second IP simultaneously.
  7. Result: the mtr output for the first IP is messed up.

Demo (terminal recording) on a fresh Vultr instance: https://asciinema.org/a/txzVrlO9LNGy2RVrk1BKCBiDn

In case it helps, here is my git bisect result:

There are only 'skip'ped commits left to test.                   
The first bad commit could be any of:                             
032d82f326d76ef1e064b5db68a2486275fa06b5                                 
22b7454a2faed8f37b608cfff320e567c5f96ab6                      
ac58c7a4b744752958975b93d8f774572186a421                                 
5d26cb0c0500b85f71a43194f090bb97064f71cf
6df4e45df4ae1504604d5eef1b0858e1cb6e42de                     
88d1a95087185339e439918a24923d5e0e816451                         
We cannot bisect more!

Gowee avatar May 01 '20 12:05 Gowee

Is anything being done about this issue? It seems to be happening a lot.

Thanks

ttrading avatar May 17 '20 20:05 ttrading

Just checking in to see if there has been any updates on this issue. Has anyone gotten any resolution on this?

Thanks

ttrading avatar Aug 13 '20 18:08 ttrading

I just encountered this issue too.

Is it fixable?

1f604 avatar Mar 23 '24 20:03 1f604

The issue is that the internet protocols were not designed with the "mtr" use in mind. So the "error message" that the intermediate routers between your host and your target send back when MTR is doing its thing are more: of the sort: "Hey , it seems you cant reach from where you are! Sorry!". There is no "and by the way the packet you were sending is hereby returned". The original packet is lost. So when we've been running for a while and we get back such error messages, we have to relate them to what packet we sent (how many hops) and how long ago (it is possible that the response took several seconds, in which time more packets were sent that are supposed to be returned by that intermediate router!)

There is just ONE piece of identifying informatiion that the routers are required to return with the error message. That's the source poirt number. So mtr uses that to determine on what sent packet the router's error response is being generated.

So when mtr is running it is using a significant number of ports is "allocated" to mtr. Say we need to probe 30 hosts away, say we don't want to reuse ports for 30 seconds... then we're already using 900 ports. There are only 65000, so according to this calculation we're already using a sizeable chunk.

It is possible that mtr will use a legitimate portnumber that another app is actively using. Then the system will recieve a "host unreachable" from an MTR probe and might then close that other program. We therefore need to use as little ports as possible.

(back in the late eighties there was a network card that had a firmware bug: the OS would configure it to only capture packets "for me" and "broadcast". The OS then assumed that all packets would indeed be "for me" or "broadcast". So when the card saw a packet "FROM hostA to hostB for port X" that packet would sometimes be passed on to the OS. The OS would then respond: "From hostC to hostA: No connection to port X exists at this computer!", and HostA would close the connection. So a third host sending error packets can be VERY annoying when it breaks up a legit connection!)

I think you can specify the range to use on the command line. Specifying different ranges will help.

rewolff avatar Mar 25 '24 11:03 rewolff