ZeroTierOne icon indicating copy to clipboard operation
ZeroTierOne copied to clipboard

Packet flooding and high CPU usage

Open darkain opened this issue 6 years ago • 93 comments

I'm trying to create a basic ZT path between two buildings. Each building has a OPNsense 18.1.9 edge router with the ZT 1.2.8 plugin installed.

Building A: LAN 192.168.2.0/24 - ZT 192.168.5.2 Building B: LAN 192.168.3.0/24 - ZT 192.168.5.3 ZT: 192.168.5.0/24 3 ZT managed routes: one for the ZT network, and one for each of the building LANs with the their respective ZT IP listed as their respective gateways.

The two OPNsense nodes are the only nodes in the ZT network. Both have bridging enabled, and auto-assign IP disabled. Flow rules in my.zt are all default. Network is idle other than a Windows box on one building's LAN pinging a Windows box on the other building's LAN (less than 2KiB/sec)

ZT is generating a MASSIVE amount of packets that is spiking the CPU to 100% regularly, yet the packets never go anywhere, and they're not generated from any of the nodes on either network. When this CPU spike happens, all connectivity over ZT is entirely dropped.

Reference: https://drive.google.com/file/d/1NIkdnilV0HSXuytMPn3zHzragyAcEa33/view?usp=sharing

You can see in the screen shot from the OPNsense interface stats that ZT has generated over 600GiB of content total, yet WAN has only transfered around 35GiB and LAN only 21GiB. These stats are for around a 24 hour period.

Nothing is matching the ZT network at all in pfTop or Firewall log, so at this point I'm not sure where next to investigate this particular issue?

darkain avatar Jun 05 '18 19:06 darkain

That is indeed extremely strange.

Can you try building one side with ZT_TRACE enabled? make ZT_TRACE=1

adamierymenko avatar Jun 05 '18 19:06 adamierymenko

Also what are these packets? What happens if you tcpdump the zt interface?

adamierymenko avatar Jun 05 '18 19:06 adamierymenko

Well crap, found the issue actually. It is a dup of https://github.com/zerotier/ZeroTierOne/issues/759

So basically, zt-one needs to be aware of managed routes and not attempt to connect to them at all. That would solve this issue. ZT kept bouncing between the remote router's WAN address and private LAN address (which would no longer be accessible once it doesn't know the route because it literally just broke it )

darkain avatar Jun 05 '18 19:06 darkain

Yes, that would be it. I've heard this phenomenon called a "software laser." :)

adamierymenko avatar Jun 05 '18 19:06 adamierymenko

Re-opening, because it is still bugged.

More specific details in my particular case.

I have two OPNsense nodes, both with ZeroTier. They each have static routes pointing to each other's LANs so two different buildings can fully cross-communicate. Static routes have been tried manually, as "managed routes" in the my.zerotier interface, and through OSPF (none of these three make a difference, bug exists regardless of how they're set)

ZeroTier attempts all available IP addresses to find ZeroTier peers. The problem is that this ALSO includes the ZeroTier private IP addresses and LAN addresses as well. ZeroTier is attempting to communicate with the remote ZeroTier instance over ZeroTier itself because it sees access to the remote node's LAN address through the static routes. As soon as this connection is made, the WAN IP address connection is disabled. Because of this, the remote LAN address is no longer available, and the ZeroTeir connection is broken. At this stage, the static route is also unreachable, so ZeroTier reverts back to the proper WAN address, and re-establishes the connection. This flapping back and forth between WAN and LAN addresses is creating an entirely unstable connection while also packet flooding. In the past 12 hours, this has consumed 1TB of bandwidth just attempting to re-establish connections. If I was not already on an unmetered internet connection, this could be literally costing me hundreds of dollars a day in bandwidth.

Example:

ZT-A > WAN > ZT-B (working) ZT-A > ZT > ZT-B > LAN (seeing the LAN address as available) ZT link switches from WAN address to LAN address ZT link breaks ZT re-establishes on WAN address

This process repeats over and over again generating a massive amount of packets flooding the system and chewing away at CPU cycles in the process as well.

Up until yesterday, "drop dport 9993;" worked by setting it in the my.zerotier interface. This prevented the ZT communication packets from transferring over the ZT interface, stabilizing the connection. No idea what changed, but this no longer functions. Prior to this, I was using a local.conf file on every single node specifying which addresses it was not allowed to connect to, but this defeats half the point of ZeroTier being a centralized management interface. This also becomes a huge pain as new routers/buildings are onboarded, every single other router in the network needs to have its configuration updated to be made aware to not allow LAN addresses from the new router. We switched from IPsec to ZeroTier+OSPF specifically for centralized and automated configuration, just to be put back where we were in the first place.

Config for individual node (note: each time a new building is added, it must be added to ALL other routers) {"physical": {"192.168.1.0/24":{"blacklist":true}}}

darkain avatar Dec 21 '18 19:12 darkain

https://github.com/zerotier/ZeroTierOne/blob/52c4385c16ec1e989369d550e0f720a050f72e32/service/OneService.cpp#L2398

Do we need one of these sections for BSDs?

laduke avatar Dec 21 '18 20:12 laduke

The issue at hand is not about binding to a particular interface. In this case, it is binding to WAN and LAN interfaces. The issues is as soon as LAN subnets are bridged between two different locations (via Managed Routes) or otherwise, the two ZT nodes will then attempt to communicate between each other via the LAN instead of WAN addresses. The LAN addresses should still be bound for local nodes.

Instead, I think ZT traffic should be flagged and filtered out from being allowed to be passed over a ZT tunnel. Is there ever a case when a ZT network should be encapsulated inside of another ZT network?

darkain avatar Dec 21 '18 21:12 darkain

Is there ever a case when a ZT network should be encapsulated inside of another ZT network?

Yes there is. For instance, Google Kubernetes Engine only has link local ipv6 addresses on it's kuberneres nodes, so we use a ZeroTier network to pipe in a routable /64 to kuberneres. This is controller traffic, but it's still ZeroTier packets encapsulated in a ZeroTier network

glimberg avatar Dec 22 '18 02:12 glimberg

I seem to have this same problem. Instead of having ZT traffic going over itself using IPv4 address of the remote LAN my problem seems to caused by IPv6 address propagated over the ZT link to the remote site.

After setting up ZT between two LANs everything usually works fine for some time, but eventually it ends up to the same state described here earlier: address of the peer fluctuates between proper public IPv4 address and private IPv6 address that was propagated to other side of the ZT link via IPv6 RA. When peer listing shows this private IPv6 address as peer's active address, CPU usage hits 100%, the connection brakes and huge volume of traffic is generated. Strangely the generated traffic has same IPv6 address both as source and destination (the address of remote peer).

Problem with my situation is that even blacklisting the IPv6 network in local.conf doesn't solve this situation, however. :/

Any ideas that might help here?

chacal avatar Dec 29 '18 17:12 chacal

It seems that I was able to fix my problem by adding the bridge interface on the remote LAN end to interfacePrefixBlacklist on local.conf. Now the IPv6 address still propagates there properly, but it doesn't seem to be used for ZT traffic anymore and thus sending ZT traffic "over itself" is avoided.

chacal avatar Jan 04 '19 17:01 chacal

I've switched up to trying to same for now to see how it goes. I have the following local.conf that I'm starting to test as of today:

{ "settings": { "interfacePrefixBlacklist": ["zte"], "allowTcpFallbackRelay": false } }

Right now I'm trying to create a standardized configuration for easier deployment in multiple data centers. I plan on doing a full write up of basically an autonomous multi-network routing system using ZeroTier, essentially a private virtualized internet on top of the internet itself. Hopefully with this simple config, I can now have ZT entirely stable and focus on the other services on top of it!

darkain avatar Jan 29 '19 21:01 darkain

I'm also having this issue over a bridged setup, and adding "drop dport 9993;" to my flow rules also helped for a few days but no longer works. I'm planning to try the above blacklisting method. Can anyone advise as to where my local.conf file would live on Raspbian/Debian, or where I should create it? I'm pretty new to Linux, and Googling interestingly hasn't helped answer this seemingly straightforward question. Thanks!

cferrey avatar Feb 05 '19 12:02 cferrey

Here's some information about the local.conf file: https://github.com/zerotier/ZeroTierOne/tree/master/service

On Debian it should be placed to /var/lib/zerotier/local.conf (assuming you have installed ZeroTier from prebuilt .deb package).

chacal avatar Feb 05 '19 12:02 chacal

Here's some information about the local.conf file: https://github.com/zerotier/ZeroTierOne/tree/master/service

On Debian it should be placed to /var/lib/zerotier/local.conf (assuming you have installed ZeroTier from prebuilt .deb package).

Thank you very much -- I did not have that file, but created it with sudo nano and added this single line:

{ "settings": { "interfacePrefixBlacklist": ["br0"], "allowTcpFallbackRelay": false } }

Unfortunately, this didn't work. I also tried adding my ZT interface instead of the br0 interface, but no luck. Do you have any thoughts on what I'm doing wrong? My br0 interface bridges the ZT and eth0 interfaces, and br0 receives a static IP while eth0 has no IP assigned.

Edit: I also added a physical route blacklist for the common subnet being used on my ZT network and at both remote LANs in my L2 bridged setup. This also did not work. My full local.conf file is below -- hoping someone can point out any issues.

{ "physical": { "10.0.0.0/16": { "blacklist": true } }, "settings": { "interfacePrefixBlacklist": [“br0"], "allowTcpFallbackRelay": false } }

cferrey avatar Feb 05 '19 13:02 cferrey

Don't know about your specific setup, but for me blacklisting using IP address helped. My local.conf (with IP address obfuscated):

{
  "physical": {
    "2001:2003:xxxx:xxxx::/56": {
      "blacklist": true
    }
  }
}

The mentioned IPv6 network is the one that is propagated to the remote site using IPv6 router advertisements.

chacal avatar Feb 05 '19 14:02 chacal

As an update, the interface blacklist didn't work. Also, I now know why the flow rules for 9993 don't work, but that'll be a separate issue.

darkain avatar Feb 05 '19 17:02 darkain

Don't know about your specific setup, but for me blacklisting using IP address helped. My local.conf (with IP address obfuscated):

{
  "physical": {
    "2001:2003:xxxx:xxxx::/56": {
      "blacklist": true
    }
  }
}

The mentioned IPv6 network is the one that is propagated to the remote site using IPv6 router advertisements.

Unfortunately I've hit a dead end here. I have no IPv6 addresses in my setup, as I am assigning IPv4 addresses to the bridging devices manually through ZT Central. I don't see any IPv6 addresses when I do listpeers on the bridging devices, so I'm at a loss as to what else to try.

I'll keep an eye on #915. Hope that can be resolved and that it'll fix all these issues, as setting a single flow rule seems much more scalable than editing configs on all ZT clients.

cferrey avatar Feb 06 '19 02:02 cferrey

This (and #759) is still broken, if anybody is interested :( I still have ZT nodes with their INTERNAL ip address in my peer list.

4ccda4xxxx 1.4.6  LEAF      94 DIRECT 0        16406    10.4.0.2/9993

This IP can only be reached over the ZT tunnel itself. Zerotier tries to do just that, using one CPU core to 100% and sending millions of packets that never go anywhere, until something resets and it goes back to normal. The annoying thing is that this causes connections to all other nodes to drop or at least go bad, because the CPU usage causes a general latency spike (up to 1500 ms, then pings time out).

Why does Zerotier not blacklist all ZT interfaces and all internal routes internally as default? Is there any use case for allowing ZT connections over a ZT tunnel?

StrikerTwo avatar Jan 09 '20 11:01 StrikerTwo

And this is how a packet spike looks like:

Ethernet Type IP Protocol Source Address Destination Address Source Port Destination Port Service Name Status Packets Count Total Packets Size Total Data Size Data Speed Maximum Data Speed Average Packet Size Maximum Packet Size First Packet Time Last Packet Time Duration Latency Process ID Process Filename TCP Ack TCP Push TCP Reset TCP Syn TCP Fin Maximum Segment Size TCP Window Size TCP Window Scale TTL Source Country Destination Country
IPv4 UDP 10.0.0.3 10.4.0.2 28053 9993 2.480.905 1.525.624.328 1.456.158.988 10723.3 KiB/Sec 614.9 1460 09.01.2020 11:45:23 09.01.2020 11:50:21 00:04:57.213 1992 zerotier-one_x64.exe 0 0 0 0 0 127

StrikerTwo avatar Jan 09 '20 11:01 StrikerTwo

@StrikerTwo are you on a BSD?

laduke avatar Jan 09 '20 19:01 laduke

Nope, Windows Server on both sides (2012 R2 / 2016)

StrikerTwo avatar Jan 10 '20 08:01 StrikerTwo

Heh looks like it doesn't avoid binding 'zt' interfaces on windows either, but I dunno

laduke avatar Jan 10 '20 18:01 laduke

FWIW this issue is affecting me as well. I have a very simple zt network, defined with all defaults. Nothing was customized. I have one Windows 10 PC on the LAN running zt, and one PC with same on another LAN. I use zt for remote access using RDP. Every few days or so the whole LAN grinds to a halt for about a minute and then mysteriously clears up. I traced one such incident using wireshark and there's millions of packets flowing over the LAN heading towards zt nodes. I uninstalled zt and the problem went away. This is a shame. It is such a great product, but this is a fatal flaw.

rexxfan avatar Jan 22 '20 13:01 rexxfan

Are you sure this is not valid traffic generated by the windows 10 pc's? E.g. windows update traffic between them? See https://www.digitalcitizen.life/how-set-windows-10-get-updates-local-network-internet

janjaapbos avatar Jan 22 '20 13:01 janjaapbos

Yes I am sure. I do not have that feature of Windows update configured. I get all my updates by downloading them from Microsoft. Neither the obtain updates from other PCs on the LAN or obtain updates from other PCs on the internet options are turned on.

On Wed, Jan 22, 2020, 8:32 AM janjaapbos [email protected] wrote:

Are you sure this is not valid traffic generated by the windows 10 pc's? E.g. windows update traffic between them? See https://www.digitalcitizen.life/how-set-windows-10-get-updates-local-network-internet

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zerotier/ZeroTierOne/issues/779?email_source=notifications&email_token=AOKKKEM5YQU5ZJXZVNA3D33Q7BDHXA5CNFSM4FDNUQC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJTR3RQ#issuecomment-577183174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOKKKEPVBOZHQ3DP5YTZOO3Q7BDHXANCNFSM4FDNUQCQ .

rexxfan avatar Jan 22 '20 14:01 rexxfan

Me neither. And 9993 is Zerotier's port, nothing to do with Windows Update.

StrikerTwo avatar Jan 22 '20 14:01 StrikerTwo

Yes I am sure. I do not have that feature of Windows update configured. I get all my updates by downloading them from Microsoft. Neither the obtain updates from other PCs on the LAN or obtain updates from other PCs on the internet options are turned on. On Wed, Jan 22, 2020, 8:32 AM janjaapbos @.***> wrote: Are you sure this is not valid traffic generated by the windows 10 pc's? E.g. windows update traffic between them? See https://www.digitalcitizen.life/how-set-windows-10-get-updates-local-network-internet — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#779?email_source=notifications&email_token=AOKKKEM5YQU5ZJXZVNA3D33Q7BDHXA5CNFSM4FDNUQC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJTR3RQ#issuecomment-577183174>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOKKKEPVBOZHQ3DP5YTZOO3Q7BDHXANCNFSM4FDNUQCQ .

So, as it turns out, somehow the setting to obtain updates from other PC's on my LAN had been turned back on on all my machines. I could have sword they were all set to off. Maybe the 19H2 update did it, I don't know. In any event I have reconstituted my ZeroTier network and am attempting to recreate the issue. Here's hoping I can't. We shall see. My apologies for relying on my memory of those settings rather than actually checking them.

rexxfan avatar Jan 25 '20 00:01 rexxfan

Check zerotier-cli peers for zerotier ip addresses in the path column, like StrikerTwo had.

In either case, maybe we should drop port 7680 in the default rules, or at least have a note about it.

laduke avatar Jan 25 '20 00:01 laduke

Ok, I checked into that. Here is what I found:

C:\ProgramData\ZeroTier\One>zerotier-one_x64.exe -q peers 200 peers <lastTX> <lastRX> 34e0a5e174 - PLANET -1 DIRECT 2132 1778 147.75.92.2/9993 3a46f1bf30 - PLANET 155 DIRECT 2132 1968 185.180.13.82/9993 763bc49b78 1.4.6 LEAF 116 DIRECT 28 136 73.5.204.223/31602 778cde7190 - PLANET 87 DIRECT 12822 2040 103.195.103.66/9993 992fcf1db7 - PLANET 171 DIRECT 2134 1948 195.181.173.159/9993 a0cbf4b62a 1.4.1 LEAF 506 DIRECT 10244 10244 34.94.185.87/21017

rexxfan avatar Jan 26 '20 19:01 rexxfan

You need to check that while your network behaves strangely. Not when everything is okay.

StrikerTwo avatar Jan 26 '20 19:01 StrikerTwo