netbird icon indicating copy to clipboard operation
netbird copied to clipboard

Netbird Loses Connection to Peers

Open trbutler opened this issue 7 months ago • 3 comments

Describe the problem

A Netbird peer has internet connectivity and can talk to some of its other peers, but over time the number of available peers degrades. This also seems to harm network routes. Restarting the Netbird service temporarily restores full connectivity.

I've created an Ansible playbook to ping the peers on the network. I run the test from a Linux server that is a peer on the Netbird network that has a reliable half-gig connection to the Internet. Several of the peers it tries to reach are expected to be offline at any given time, but there are others that should always be accessible (they are in quality datacenters with redundant connections).

I've tried the setup with both Network Monitoring enabled and disabled. The problem remains the same.

Here's a table of running those Ansible ping tests. For example see how things vary right after a fresh Netbird restart late last night on the peer doing the testing (Independence), a few hours later, this morning and then again after another fresh restart of Netbird moments ago:

Host After NB Restart Hours Later Next Morning Fresh NB Restart Location
amurmaple.anon-ZDXFz.domain Datacenter #2 (QB)
beatrice.anon-ZDXFz.domain Studio
bigleafmaple.anon-ZDXFz.domain Datacenter #2 (QB)
boaz.anon-ZDXFz.domain Office
cyprus.anon-ZDXFz.domain Datacenter #1 (CA)
falstaff.anon-ZDXFz.domain Home
franklin.anon-ZDXFz.domain Studio
independence.anon-ZDXFz.domain Studio
juniper.anon-ZDXFz.domain Datacenter #1 (CA)
madison.anon-ZDXFz.domain Home
maple.anon-ZDXFz.domain Datacenter #2 (QB)
mesquite.anon-ZDXFz.domain Datacenter #1 (CA)
oberon.anon-ZDXFz.domain Home
rahab.anon-ZDXFz.domain Office
rosalind.anon-ZDXFz.domain Studio
spruce.anon-ZDXFz.domain Datacenter #1 (CA)
sugarmaple.anon-ZDXFz.domain Datacenter #2 (QB)
thomas.anon-ZDXFz.domain Office
touchstone.anon-ZDXFz.domain Studio

Notably, a number of the failed peers, such as Mesquite and Amurmaple are the ones in data centers and their public connections to the Internet remain online even as they fail. Others, such as Franklin, are right next to the server (Independence) that is doing the testing -- those two are on the same switch on the same network. But not all systems on the same network fail (Touchstone is on the same network, for example) nor do all at the datacenter fail (Spruce is actually the bare metal server to which Mesquite is a container). So, I don't see a particular obvious "genre" of hosts that go down versus others that remain up.

Maple and Spruce also utilize a HA network route to access a service (192.168.5.140) which Independence, Franklin, Beatrice and Touchstone are members of the route. It intermittently becomes unavailable, but if I restart netbird on either Maple or Spruce and the four HA members, the route will begin working again.

You'll see Maple is listed as unavailable in the most recent test (coming from Independence), but if I ping it from Oberon, Maple remains available.

Note: All the "office" location systems are offline because of a power outage. Feel free to ignore those, but I wanted to include them in the table just in case such a situation might somehow "ripple" in an unexpected way. When the power is on there, they exhibit a similar situation, where Thomas will become unavailable frequently whereas Boaz remains available most of the time. But, that isn't perfectly consistent: sometimes it is Boaz that goes down and not Thomas.

To Reproduce

Steps to reproduce the behavior:

  1. Ensure full communication with online peers. Run ping test.
  2. Wait several hours.
  3. Run ping test and notice servers at two distinct datacenters are no longer reachable.
  4. Run service netbird restart on the peer doing the test and note the peers that were offline are now reachable again. (Repeat ad infinitum.)

Expected behavior

Peers remain able to talk to each other and to access high availability network routes even if one peer goes offline.

Are you using NetBird Cloud?

Self-hosted 0.44.0

NetBird version

0.44.0 on all but one peer.

Is any other VPN software installed?

No.

Debug output

20250520.log.txt

File key: 1a6ecdff51f59139b215eb9feb49b9dd88a71ef56826d1bdb2744db0448f3680/bf0674ce-c1f1-47ef-a7f1-d4c2900eebe6

Additional context

I'm not sure if this has any relation to the problems Netbird has resuming after sleep on my network's MacOS clients, see issue #2454.

Have you tried these troubleshooting steps?

  • [ x ] Reviewed client troubleshooting (if applicable)
  • [ x ] Checked for newer NetBird versions
  • [ x ] Searched for similar issues on GitHub (including closed ones)
  • [ x ] Restarted the NetBird client
  • [ x ] Disabled other VPN software
  • [ x ] Checked firewall settings

trbutler avatar May 20 '25 18:05 trbutler

Hey all,

In the absence of an official response from the NetBird team, here is my experience:

We were having the same unreliability between peers and have been scanning the repo's issues to see if other people have also been having the same problem or if there have been any suggestions for fixes/mitigations. I found your issue and after looking at your debug output I noticed that (at least some of) your peers had quantum resistance enabled. This made me consider whether Rosenpass was causing problems. To test this we disabled quantum resistance and instead set a pre-shared key on all peers, from then we have had perfect uptime. However, YMMV as I have not monitored pings as closely as you did in your original comment.

With this and the fact that our preferred Android client doesn't work with Rosenpass enabled (you can read more on their repo) we have decided to entirely stop using NetBird's implementation of quantum resistance via Rosenpass.

Hope this helps...

Markovich01 avatar May 28 '25 19:05 Markovich01

@Markovich01 Thank you. I hate turning off quantum resistance, but this does indeed seem to be helping. On another, possibly related report I'd filed, I just wrote about the improved network stability I'm observing after doing as you suggested:

Following up on what @Markovich01 had suggested in #3852. I disabled quantum resistance on every peer on my Netbird network (since the peers couldn't communicate if I only turned it off on some of them) and about 12 hours later, all the peers are still talking to each other. Perhaps more notably, thus far, my attempts to wake the Mac have been smooth without the 100%+ core usage I've been sharing in this bug report. Moreover, Netbird's average CPU usage, which had been hovering between 10-20% constantly on an Apple M3 has dipped to about 0.6% (which is, obviously, way better for battery life on a laptop!).

Whereas previously Netbird would be the most demanding process upon wake and remain in the top five most demanding processes constantly, it seems to be hovering around the 15th most demanding process in Activity Monitor at the moment.

This makes me strongly suspect the bug with waking up a Mac has something to do with Rosenpass. I've been using Rosenpass in Permissive mode, incidentally -- I have not tested it in the non-permissive mode. What is less clear is if it has to do directly with Rosenpass on MacOS or the general network instability that seems to be happening with Rosenpass, since this switch does seem to be improving the general Netbird network instability I reported in that other bug report, as well.

This makes me think if whatever is causing issues with Quantum Resistance were fixed, it might fix problems with MacOS sleep-wake, as well. Hopefully it will continue this trend -- at least it appears my Netbird network has become more reliably usable than it has been in recently months!

trbutler avatar Jun 03 '25 18:06 trbutler

About a month after @Markovich01's suggestion, I've continued with Rosenpass off and Netbird is running incredibly reliably (system resource usage has been dramatically been reduced and fails on wake from sleep per #2454 have also been resolved). That's wonderful for being able to depend on Netbird's network, but it's unfortunate not being able to use Rosenpass and it seems to confirm there's a Rosenpass problem at the heart of both this report and #2454.

trbutler avatar Jul 13 '25 15:07 trbutler

@trbutler, thanks a lot for the delayed feedback and the summary of what else was fixed by disabling Rosenpass. I'll pass it down to the team.

nazarewk avatar Jul 14 '25 09:07 nazarewk