ZeroTierOne icon indicating copy to clipboard operation
ZeroTierOne copied to clipboard

instability on Linux machines

Open PatrykW-95 opened this issue 3 years ago • 9 comments

Hi, after many tests on different platforms, kernels, networks etc I'm stuck with always the same weird issue i'm going to explain: LAN 1 office: A) Windows 11 (Zerotier-one 1.8.6) B) Debian 11 (Zerotier-one 1.8.6 (builded with debug option))(VM On W11 with bridged ethernet adapter)

LAN 2 home: C) Windows 11 (Zerotier-one 1.8.6) D) Debian 11 (Zerotier-one 1.8.6 (builded with debug option))(VM On W11 with bridged ethernet adapter)

LAN 1 and 2 are two separated networks kilometers apart.

All machine have NetCat (on window 11 I'm using NCat from nmap as suggested here ).

If on A I start NC listening and use C to connect to it (or viceversa) works all perfectly with relativly low latency. if on B I start NC listener and use C or D only sometime I'm able to send a packet and recive it. The same behavior occurs when the listeners are in "LAN 2 home" and use ZTs from office as clients.

It seems that Windows clients works well bidirectionally whilst the Linux client has problem receiving packets

On debian i've build it from the master branch and make it run form command line. In the output i've found that the debians are continuously requesting configuration for network(about every 10 secs) and in the moment i try to connect with nc cames out MAC failed for packet 97127342bb532374 from 38b02ee846(xxx.xxx.xxx.xxx/9993). I'm using controller on my.zerotier.com where all zerotiers are authorized and online. I've alse tried older versions like 1.4.4. ,1.4.6., 1.8.4., 1.8.5.

PatrykW-95 avatar Mar 14 '22 10:03 PatrykW-95

Thanks for reporting this.

Have you tried with just Linux <-> Linux? In these tests I see Windows as a potential variable to control for.

I've alse tried older versions like 1.4.4. ,1.4.6., 1.8.4., 1.8.5.

So you're saying this problem exists with all of these versions?

joseph-henry avatar Mar 15 '22 16:03 joseph-henry

I've tried it and it seems to be working but using netcat every 10 packet i send only 3 arrives correctly. I'd like to specify that in every version i tested the behaviour is the same and i've also disable windows firewall.

PatrykW-95 avatar Mar 16 '22 08:03 PatrykW-95

Can you check that in your output for zerotier-cli peers that you have a DIRECT path between your nodes of interest? This sounds like you might be relaying.

joseph-henry avatar Mar 17 '22 16:03 joseph-henry

I had a similar problem. I first changed MTU, which helped, but ended up moving away from ZT1 for that project.

I did work fine for about a year, but broke, I think sometime at 2021.

qt1 avatar Mar 17 '22 20:03 qt1

HI @joseph-henry. All the peers including planets and leafs are DIRECT. I've also tried changing MTU(2800,1500,100,68) in both VMs as suggested by @qt1 and nothing changed. There's samething else that I can try to resolve the problem?

PatrykW-95 avatar Mar 21 '22 08:03 PatrykW-95

could this be related to https://github.com/zerotier/ZeroTierOne/issues/1422?

rhetr avatar Mar 24 '22 15:03 rhetr

could this be related to #1422?

Maybe.

Just a shot in the dark, but we've recently fixed a bug that could cause some packets to fail their MAC (validation). This fix is available in dev and I'd be curious if it helps at all.

joseph-henry avatar Apr 20 '22 16:04 joseph-henry

I have found that version 1.8.x of Zerotier is not stable on my ac86u and ax86u routers, whereas version 1.4.x does not have these issues. I have set the MTU to 1388, as Zerotier 1.4.x on the router requires this to function normally.

I have encountered the following problems:

  1. Sometimes pings (ICMP) are successful from all peers and both sides, but TCP connections wait for data forever. zerotier-cli info shows ONLINE, and listpeers is all OK. To fix this, I have to restart zerotier-one.
  2. Sometimes zerotier-cli info goes OFFLINE and will not come back up unless zerotier-one is restarted manually. The zerotier-one -d process shows as active, but the status remains OFFLINE. This problem has been seen since version 1.6.x, but it has become more frequent and significantly worse with version 1.8.x.

I have also found that versions 1.6.x/1.8.x on Windows 10 have performance issues when accessing LAN IPs. For example, when accessing my Emby server or RDP on my NAS (which has the IP address 192.168.9.7) from my Zerotier network (which has the IP address range 10.9.8.0/24), the connection becomes slower and slower until I have to disconnect Zerotier, at which point the network speed immediately returns to normal.

My network topology is as follows: WIN10 (10.9.8.8|192.168.9.8) <---> Router ax86u (10.9.8.4|192.168.9.1) <---> NAS_WIN10 (192.168.9.7)

My kernel is Linux RT-AX86U 4.1.52 #2 SMP PREEMPT Fri Mar 25 11:09:29 EDT 2022 aarch64 ASUSWRT-Merlin.

I have to rely on a one-minute crontab script to prevent Zerotier 1.8.x from failing, which is really annoying. I have not come up with a better idea for detecting TCP connection failures.

# -----------------------------------------
# Example : initCheck
# Argu    : none
# Input   : None
# Return  : None
function initCheck() {

    ZT_ONLINE=$(zerotier-cli info| grep -i "online")
    if [ -z "$ZT_ONLINE" ];then
        sysLOG "ZT OFFLINE! restarting" warning ;
        /opt/etc/init.d/S91zerotier-one restart
        return 0;
    fi
    
    ZT_INTERFACE=$(ip -o link show | grep -oP '\d{1,2}:\s\Kzt[\w]+' | head -n1);
    # fallback
    if [ -z "$ZT_INTERFACE" ];then 
        echo "get zt interface empty, try another way";
        ZT_INTERFACE=$(ip -o link show | awk -F': ' '{print $2}'|grep "^zt");
    fi
    
    # sometimes dev zt0 would disappeared until you restarted zerotier
    if [ -z "$ZT_INTERFACE" ];then 
        sysLOG "zt+ dev disappeared! Restarting" warning ;
        /opt/etc/init.d/S91zerotier-one restart
    fi
    
    # MTU is causing lots of problems
    if [ ! -z "$ZT_INTERFACE" ];then ifconfig "$ZT_INTERFACE" mtu 1388; fi

    # add base route tables
    if [ ! -z "$ZT_INTERFACE" ];then baseRoute; fi
}

myfingerhurt avatar Apr 26 '22 02:04 myfingerhurt

Just a shot in the dark, but we've recently fixed a bug that could cause some packets to fail their MAC (validation). This fix is available in dev and I'd be curious if it helps at all.

doesn't seem to help me

rhetr avatar Apr 26 '22 23:04 rhetr

Is anyone in this ticket still having issues? Since the creation of this ticket there have been two MTU-related fixes:

  • #1860 (will be in next release) lets you set the ethernet tap's overlay network MTU (default 2800)
  • #1844 (will be in next release) lets you set the Physical MTU on a per-link basis using multipath. It will likely be generalized to work without multipath but you can try things like:
{
 "settings":
 {
   "defaultBondingPolicy": "ab",
   "policies":
   {
     "ab":
     {
       "basePolicy": "active-backup",
       "failoverInterval": 30000,
       "links": {
         "eth0": { "mtu": 1400  }
       }
     }
   }
 }
}

Hopefully something can be of use. If you have further questions please let me know.

joseph-henry avatar Jan 25 '23 21:01 joseph-henry