ZeroTierOne
ZeroTierOne copied to clipboard
1.10.1 drops connectivity completely
Before filing a Bug Report
If you still want to file a Bug Report
Please let us know
- What you expect to be happening.
stable point-to-point connectivity within same LAN across "n-1" ZT version & within same version for a period of time while upgrading, with good throughput.
- What is actually happening?
complete failure to transfer data over ZT networks that have been stable for months on ZT1.8.9 and earlier, when 1 sender is at 1.10.1 and receiver is still on ZT1.8.9.
-
Any steps to reproduce the error.
-
run FreeBSD 13.1-RELEASE (amd64 or arm64)
-
build & run ZT1.8.9 on one node, ZT1.10.1 on another run
iperf3 --serveron old version,iperf3 --client <serverIP> --get-server-output --time 120on from new version -
note that no traffic makes it through
-
run tcpdump on the underlying interface with the appropriate port and see similar traffic on the physical wire (pcap available privately on request)
-
note that
ping -b 10240 -s 9000 fc...:1works -
note that switching sender/receiver roles, everything works fine
-
Any relevant console output or screenshots.
### a01 (ZT1.8.9)
iperf3 --server
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from fc...::1, port ...
...
[ 36] 118.00-119.02 sec 0.00 Bytes 0.00 bits/sec
[SUM] 118.00-119.02 sec 0.00 Bytes 0.00 bits/sec
iperf3: error - idle timeout for receiving data
### w01 (ZT1.10.1)
...
iperf3 --parallel 16 --client a01 --zerocopy --get-server-output --time 120
Connecting to host a01, port ...
...
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.25 MBytes 10.5 Mbits/sec 2 2.66 KBytes
[ 7] 0.00-1.00 sec 1.25 MBytes 10.5 Mbits/sec 2 2.66 KBytes
...
[SUM] 0.00-120.01 sec 20.0 MBytes 1.40 Mbits/sec 146 sender
[SUM] 0.00-120.01 sec 0.00 Bytes 0.00 bits/sec receiver
iperf3: error - control socket has closed unexpectedly
with receiver running on zt1.10.1, its all good sending from zt.1.8.9 --- in both directions)
# i01 (zt1.8.9)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 55.8 MBytes 46.8 Mbits/sec 73 sender
[ 5] 0.00-10.03 sec 54.0 MBytes 45.2 Mbits/sec receiver
Is this some weird asymmetric routing? AFAICT ipv6 route table looks Just Fine on both ends:
route to: fc...::11
destination: fc::
mask: ffff:ffff:ff00::
fib: 0
interface: ztagim5o45dhe4c
flags: <UP,DONE,PINNED>
recvpipe sendpipe ssthresh rtt,msec mtu weight expire
0 0 0 0 2800 1 0
- What operating system and zt version?
Various FreeBSD 13.1 & 14.0-CURRENT systems, both aarch64 and amd64. Also OpnSense.
cc @darkain anything to add?
Looking at zerotier-cli peers on all nodes shows that everybody does have direct connections, to the correct IP/port.
Hello. That's strange. I had 2 debian 11s already set up and couldn't reproduce there. Will try on freebsd...
Did something get stuck during upgrade/restart, in pf or something? Does it happen on a different zerotier network?
I can 100% confirm this behavior as well, and have been documenting notes over at: https://twitter.com/DarkainMX/status/1546544760336179200
Something is definitely wrong with 1.10.x at least on FreeBSD. It'll do the initial connection, and work for a -random-? amount of time, and then just die uncontrollably, where one of the two sides needs to be restarted to regain connection.
Connections break for me in a matter of only a few minutes with 1.10.1, but work endlessly on any 1.8.x version.
In my particular case, I'm manually building the 1.10.1 package, since we don't have one yet for FreeBSD, using the existing Port and bumping the version (what I normally do in testing out Zerotier on various hardware platforms before we push to the main Ports tree). Afterwards, I just do a "pkg add" with my manually created port, which over-writes the binaries in place, doesn't touch configs at all, and then I do a "service zerotier restart"
Downgrading from 1.10.1 back to 1.8.9 with the same process, just replacing the package in-place once again, and restarted the service, and things were instantly stable again.
This may be related to a bug we found in central earlier today. Gut feeling, no real evidence.
Just wanted to communicate that. It should be resolved later this week.
Do you get the same behavior with self-hosted controllers?
------- Original Message ------- On Monday, July 11th, 2022 at 9:40 PM, Vincent Milum Jr @.***> wrote:
I can 100% confirm this behavior as well, and have been documenting notes over at: https://twitter.com/DarkainMX/status/1546544760336179200
Something is definitely wrong with 1.10.x at least on FreeBSD. It'll do the initial connection, and work for a -random-? amount of time, and then just die uncontrollably, where one of the two sides needs to be restarted to regain connection.
Connections break for me in a matter of only a few minutes with 1.10.1, but work endlessly on any 1.8.x version.
In my particular case, I'm manually building the 1.10.1 package, since we don't have one yet for FreeBSD, using the existing Port and bumping the version (what I normally do in testing out Zerotier on various hardware platforms before we push to the main Ports tree). Afterwards, I just do a "pkg add" with my manually created port, which over-writes the binaries in place, doesn't touch configs at all, and then I do a "service zerotier restart"
Downgrading from 1.10.1 back to 1.8.9 with the same process, just replacing the package in-place once again, and restarted the service, and things were instantly stable again.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
@erikh I'm 99% sure the issue you're talking about is completely unrelated. The issue Travis found is a display issue that only affects "public" networks
my b
I couldn't get it to happen yesterday on freebsd 13.0 vms I had. They are on the same physical LAN though. Do I need to try across the internet or through NAT? I built by typing gmake in the zerotierone repo. I had to gmake ZT_SSO_ENABLED=0 for 1.8.9.
Can someone tell me how to build different versions from git with the freebsd pkg system, if there are patches in there that would make a difference? I'm not familiar.
Install FreeBSD with the "Ports" option selected during install. This will create the folder /usr/ports with tons of stuff inside, including /usr/ports/net/zerotier. Enter that folder, and type make to build the version included. You can also check out the files subfolder to see what "patches" are applied.
To change which version of the code is compiled, edit the Makefile in the ZT's ports folder. Version number is right at the top. After that, run make makesum to update the checksum info, and then make to build and make install to install it. This particular port is designed to pull the code based on the tags here on GitHub, so any tag in zerotier/ZeroTierOne should work.
If Zerotier is already installed, use make reinstall instead and it'll figure out what it needs to do (uninstall old package and install new one).
Also noted in my Twitter thread: I did NOT experience these issues either with my limited testing with a VM on an Intel based Xeon server. HOWEVER, this issue happened on ALL of my routers that are AMD based and run bare metal. I've yet to test my ARM systems or other various configurations, or pump more data through my Intel VM to see if it would eventually break (normally no traffic hits that VM, so it may be due to amount of traffic, not platform)
Awesome. I should put that in the readme...
I wonder if @dch is on AMD?
looks like the ext/json got moved to ext/nlohmann, so that license line needs to be updated in the makefile
@laduke the full diff for ports is https://reviews.freebsd.org/D35770 | https://reviews.freebsd.org/file/data/65thfz333ymhsuhh2iqm/PHID-FILE-ecnkatlrgpsuerqlnz3e/D35770.diff it needs a little more tweaking than just changing the version as you found.
This package is built for 13.1-RELEASE (which is the current supported version of the 13.x line). It may work on 13.0-RELEASE I can't say. If you're stuck look for dch on irc in the usual places.
$ sudo pkg add https://pkg.skunkwerks.at/FreeBSD:13:amd64/All/zerotier-1.10.1.pkg
for the other box, you should be able to simply:
$ sudo pkg install -r FreeBSD net/zerotier benchmarks/iperf3
In both cases, easiest is to reboot VM to make sure it's ok.
TLDR:
- I see no issues with virtualised systems
- the 1.8.9 <-> 18.9 works and 1.10.1 <> 1.10.1 works fine, as does 1.10.1 server with 1.8.9 client connecting, only the reverse fails, and only on bare metal
- only bare metal when running ZT 1.8.9 and connecting a ZT 1.10.1 client
e.g. when running iperf --server with ZT 1.8.9 and connecting a ZT 1.10.1 client via iperf --parallel 16 --client <zt-ip> --zerocopy --get-server-output --time 120s.
In my local test system, its 1G & 10G NICs, same subnet, 1 switch, no firewalls in between. Tried both amd64 (intel xeon & similar) and arm64 with same results. Traffic only works one way!
@darkain did you get any results when switching send & recv (e.g. via iperf) ? we should also try on bare metal, with all tunables disabled (this isn't feasible for me this week though).
| node | version | zt | cpu |
|---|---|---|---|
| c01 | 13.1 | 1.8.9 | atom c2750 |
| w01 | current | 1.10.1 | xeon e5-2667 |
| i09 | 13.1 | 1.8.9 | xeon e3-1275 |
| s01 | current | 1.10.1 | arm64 32core |
| a01 | current | 1.10.1 | arm64 4core vm |
The notable difference is that I onlyuse 6PLANE in zerotier, no IPv4 at all (so cool), and none of these are VMs.
I can't repro on VMs either. Upgraded to 13.1 and tried your packages too (thanks!) Don't really have access to hardware to test bare metal on at the moment. Don't think I'd be much help at this point anyway.
I have many Opnsense (freebsd based) firewalls where i using zerotier. This is my experience: zerotier 1.8.6 or below versions works in every scenario. zerotier 1.8.9 or 1.10.1 break everything no matter if im on freebsd 13.0 or 13.1. All these firewalls are vmware VM using with vmxnet adapter. When i say break it literally break the whole network! Its generate enormous amount of bogus traffic and eventually the state table and the mbuf usage exhausted. When the bogus traffic is generated every member of the network got those packets, around 100Mbit/sec flow, which is zeroed the whole network legit communication! Even more strange, this only happens after the 4th freebsd node is upgraded to 1.8.9 or 1.10.1. Also only happens when the freebsd nodes connected to the same network.
I use this on all node: { "physical": { "10.0.0.0/8": { "blacklist": true }, "172.16.0.0/12": { "blacklist": true }, "192.168.0.0/16": { "blacklist": true } } }
This should prevent this behavior, which is working as intended in the version 1.8.6 or below, but broken in every newer versions! So if you use multiple node multiple cross routing scenarios stay away from any newer version! I did report this to zerotier because we are a paid user, but they didnt know the reason yet. It seems its still broken.
@vadonka I believe you've ran into a separate issue then, not this particular bug. The bug described in this ticket only effects 1.10.x. Reverting to 1.8.9 solves the issue. If you're having issues with 1.8.9, then that's most likely something else.
@darkain I bumped all my dev & prod systems to 1.10.1 now (arm64 + amd64 both vm and physical), not seeing any issues at all. Maybe you want to take the plunge?
@glimberg is it possible this issue stems from something in the roots? I can't see why this issue would fail previously, and work now.
I don't have a 1.8.x around to test against, but I'll try this in a couple of weeks at EuroBSDcon.
I noticed something which may cause some of the freebsd (or any) connectivity issues. We used a wan if with multiple virtual IP address. Its not carp just ip alias, but zerotier starts listening on all virtual address not just on the main interface address.
Im currently using this configuration on every firewall: The "1.2.3.3" is the WAN IF main address, so like this its only listen on that IP and nothing else. Probably all the blacklist unecessary like this but just in case i leave it there. I also noticed that its better to set false the allowSecondaryPort and portMappingEnabled because if you use any upnp that can allow zerotier to operate on a different port other than the default 9993, which may lead to issues again. With this none of my issues occured anymore. No packet flooding no connectivity drop. Im not sure what fixed it tho from all these changes.
{ "physical": { "10.0.0.0/8": { "blacklist": true }, "172.16.0.0/12": { "blacklist": true }, "192.168.0.0/16": { "blacklist": true } }, "settings": { "portMappingEnabled": false, "allowSecondaryPort": false, "allowTcpFallbackRelay": false, "bind": [ "1.2.3.3" ] } }
@darkain I bumped all my dev & prod systems to 1.10.1 now (arm64 + amd64 both vm and physical), not seeing any issues at all. Maybe you want to take the plunge?
@glimberg is it possible this issue stems from something in the roots? I can't see why this issue would fail previously, and work now.
I don't have a 1.8.x around to test against, but I'll try this in a couple of weeks at EuroBSDcon.
I'm starting to roll out 1.10.1 again based on this feedback. Mine is a mixed environment with some 1.8.x while I do a slow staggered rollout. I'm using the same binaries I built last time, so literally no change. I'll report back later this week if I see any more stability issues with this rollout.
At this point, every single node in my network is running on 1.10.1 with stability. This includes my OPNsense routers, multiple desktops, and some road warrior laptops. All are behaving as expected. Only exception is my Android phone, because the official app in the Play Store does not have a 1.10.x build yet.
I'm with @dch on this one, I think the issue may have been some wonky way the nodes were interacting with the roots.
My configs are 100% identical as last time, and I'm running the exact same binary I compiled last time too, yet things "just work" this time round.
I think we're good to bump the FreeBSD Ports tree to use 1.10.1 now.
A full week in, and still running perfectly stable on all nodes! I've had no connectivity issues whatsoever.
At this point, every single node in my network is running on 1.10.1 with stability. This includes my OPNsense routers, multiple desktops, and some road warrior laptops. All are behaving as expected. Only exception is my Android phone, because the official app in the Play Store does not have a 1.10.x build yet.
I'm with @dch on this one, I think the issue may have been some wonky way the nodes were interacting with the roots.
My configs are 100% identical as last time, and I'm running the exact same binary I compiled last time too, yet things "just work" this time round.
I think we're good to bump the FreeBSD Ports tree to use 1.10.1 now.
What config have you done? Im using a fresh installed & updated OPNSense and can´t make my LAN talk with other LAN over ZeroTier. Thanks Vitor
What config have you done? Im using a fresh installed & updated OPNSense and can´t make my LAN talk with other LAN over ZeroTier. Thanks Vitor
I think that's beyond the scope of this particular ticket. If you want to discuss OPNsense + ZeroTier related configs, you can ask in the FreeBSD discord, where I'll be more than happy to answer questions about your particular setup :)
What config have you done? Im using a fresh installed & updated OPNSense and can´t make my LAN talk with other LAN over ZeroTier. Thanks Vitor
I think that's beyond the scope of this particular ticket. If you want to discuss OPNsense + ZeroTier related configs, you can ask in the FreeBSD discord, where I'll be more than happy to answer questions about your particular setup :)
can you share the discord link? thanks
can you share the discord link? thanks
https://discord.gg/freebsd
I'm still seeing these symptoms on a regular basis with 1.10.2 and using newer versions.
Peers show direct connection to one-another. Each peer has a direct public WAN IP address.
Pinging between the ZT IPs of nodes sometimes works perfectly for weeks on end, sometimes doesnt work at all, or sometimes flaps on and off every few seconds. Its all over the place.
Usually restarting ZT on literally every single node on the network will regain connectivity for a long period of time, sometimes a day, sometimes a week, sometimes a month. But it feels like I'm having to do a full ZT cluster reboot at least once a month.
Thanks for the update. Got a few more questions, just trying to narrow things down.
- In your tweet (xeet?) you mentioned multipath, is that still something you believe to be related? And if so, how?
- Have you tried
1.12.2? Ideally every version should be perfectly backwards compatible with the previous but in cases where we can't easily replicate what you're seeing it's a huge time saver for us to know if the most recent version which includes our best attempts at fixing known issues has helped at all.
I'd upgrade two nodes to the most recent and see how well they behave. If they still show the same issue lets try to diagnose some of your broken nodes.
If you have a node that is confirmed broken we'd be interested in seeing the zerotier-cli dump (You can send it to us securely here
As many nodes as you can provide dumps for the better, we may notice a pattern.
Let me know how you'd like to proceed and I'll try to help out.
OPNsense patching auto-updated me to 1.12.2 today, which broke my ZT mesh, forcing me to revert, but even with reverting back to 1.10.2 I was still having some issues. 1.10.2 is "mostly stable", but any time any of my nodes have gone beyond it, stability tanks hard. the 1.8.x and earlier versions were always perfectly stable.
Its difficult to actually upgrade nodes as ZT is my only means of access to the majority of the nodes. Its now I do connections to locations outside of my house to other locations I manage. I guess I can setup an alternative out-of-band network of some kind, but that'll take some work. Maybe I'll deploy a secondary ZT network so I can at least go in an SSH to fix these primary nodes.
Its just difficult since these nodes handle some live traffic, and I've yet to reproduce the issue in a VM. Its only happening on nodes that have direct WAN connectivity, not NATing (but I'm also not running other nodes at this scale yet)
@joseph-henry Okay, this is interesting. the main links that like to break the most: currently there are TEN active paths between them.
For instance:
A <> B = 10 unique connection paths A <> C = 10 unique connection paths
Nodes that are always stable only have 2 active connection paths instead.
A <> D = 2 unique connection paths B <> D = 2 unique connection paths C <> D = 2 unique connection paths
Is it possible its creating too many connections, losing track of them, and then sending packets out via connections that are no longer valid for whatever reason? This is why I keep coming back to the thought that multi-pathing is the issue, because everything was always stable in ZT until the exact time multi-pathing was introduced publicly. I couldn't find any other significant change in the codebase that would explain what I've been seeing for the past year+.
One other note: Node "D" is the only one without IPv6 support natively, and that's the one that retains stable connections with ALL other nodes no matter what happens with the rest of the network. So its possible this is somehow tied into IPv6 as well. But I don't see how having dual-stack would create 5x the number of connections-per-peer.
Hmm. I forget which version it was exactly but there was a case where if too many path tuples were available (say if you had many interfaces each with many assigned addresses) it could cause ZT to start forgetting and re-learning stuff over and over. We added some mitigations to this and no longer see the issue in recent versions. I hope that's what you're seeing.
One way to test this theory without upgrading to 1.12.2 would be to disable some interfaces or reduce the number of assigned addresses per interface. If you can bring the path tuple count down ZT should smooth out.
I will admit, the "dump" sub command is one I had not seen before, and this most certainly is helping me see more insights into what ZT is doing.
I was looking further into the connectivity between two nodes. On one side, it was reporting 14!! connections, vs only 6 on the other direction.
In terms of multiple IPs, the only nodes that have multiple IPs on an interface, configs are setup explicitly to deny-list all but the primary IPv4 and IPv6 address. It created those 14 connections on JUST the two IP addresses.
I'm starting to play with 1.12.1 right now (its the version available in the OPNsense repo, maybe i'll switch to the FreeBSD repo later for 1.12.2), and things are looking better so long as BOTH sides are on the updated version. It isn't quite perfect tho.
With two nodes on 1.12.1, I'm seeing 2 connections in one direction (1x each v4/v6), but 6 connections in the opposite direction (3x each v4/v6)
Would it be possible in the dump output for a path blocks to also include the locally bound IP address/port, as well as if it is a TCP or UDP connection?
Would it be possible in the dump output for a path blocks to also include the locally bound IP address/port
That might be nice. I'm not sure how to look up anything with the localSocket number either.
dump is "just"
zerotier-cli peers -j; zerotier-cli listnetworks -j; zerotier-cli info -j plus like ifconfig
I'd suggest upgrading to 1.12.2 and enabling active-backup via multipath.
Just put the following in your local.conf for two nodes as a test:
{ "settings": { "defaultBondingPolicy": "active-backup" } }
Everything should be automatic and it will only use the "best" single path. You'll see:
- Faster dead link failover
- Probably a subjective feeling of more link reliability
- Increased ambient traffic (but this can be adjusted)
We do have a way to see the local bound socket's port but that hasn't been merged yet and would be available only when using multipath using a command like zerotier-cli bond <peerid> show. Likely will be in 1.12.3