Peer link is being dropped
Describe the problem
I have netbird installed as an overlay network. I have an ingress server and another server in another location. Neither are behind NAT. As far as I can tell everything is working properly, and generally does. The overlay peer network got dropped and the ingress server stopped talking to the other server. I noticed it quickly because of my host down alerts. I had to restart netbird on the ingress server to get the peer connection back up.
I've been running netbird for more than a year.
To Reproduce
Steps to reproduce the behavior: 1: Run netbird for a long amount of time 2: Monitor connections for loss 3: Run upgrades and configuration changes
I suspect that this happens around an upgrade or configuration change somewhere in the overall system. I am not certain as this only happens rarely. I suspect that there are bugs, probably race conditions, in the teardown and setup procedures that create this condition.
Expected behavior
Does not drop peers.
Are you using NetBird Cloud?
Self-hosted
NetBird version
Ingress was 0.29.2 on this latest time (has been observed with numerous versions), and the other server was on 0.35.0.
Additional context
I don't have time to track this down and be more specific as it's not consistent. This last incident was a production outage that I never had with nebula, so I'm switching back. The system looks nice and I've seen a lot of improvements. But I need reliability above all else and I haven't found it here. Good luck.
Hey just to chime in, based on my testing, on linux peers with versions equal or above 0.34.0 after about 5min the connection drops without recovering. In my case this is the setup:
Management Server : netbird-mgmt version 0.35.1 (docker),
Peer Server 1: 0.34.0 - 0.35.1
Peer Server 2: 0.34.0 - 0.35.1
Peer Server 3: 0.34.0 - 0.35.1
All servers run on static IPs and all three peers would be running the same version of the client. Peer Server 1 would drop the connection to other server peers roughly after 5 min from starting Netbird.
I have already attempted adding common allow All UDP ports but no use. So essentially even if we assume that management and peer servers running the same 0.35.1 one server will always fail after some time, specifically Peer Server 1. Just to clarify I've been running Netbird since version 0.27.0, and everything was working fine up until recently.
I can get the logs however as it is on live environment it will generate downtime so I will have to wait until maintenance window which will be some time in early January. In the meantime I will try to get at least logs for the Peer server 1 so there is at least some data.
Interesting timing. I've been seeing some more link instability in the last few days. Since 0.35 maybe. Requiring a restart of some peers to reconnect. Sometimes they think they are connected but are not passing traffic.
I did have some stability issues back around pre-0.20 or so and required restarting clients. Then things have been quite stable for the last many months.
I know this is very vague and doesn't provide useful information in of itself but just wanted to add in my anecdotal experience that the current instability hasn't shown up in my environment for quite some time.
Logs from a peer at the time it dropped off:
2024-12-29T11:34:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:39:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:44:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:44:43+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:44+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:47+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:49+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:54+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:45:06+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:49:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:54:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
Interesting timing. I've been seeing some more link instability in the last few days. Since 0.35 maybe. Requiring a restart of some peers to reconnect. Sometimes they think they are connected but are not passing traffic.
Hey @hadleyrich, I agree that's pretty much what I've been battling with for the past couple of weeks. I have a load balancer which uses the VPN to connect to various other VPS peers so that we can have a simple HTTP reverse proxy on port :80. As of 0.34.0, the load balancer drops the connection to the other VPS peers without retrying to connect, needing a manual restart of the Netbird client.
I did have some stability issues back around pre-0.20 or so and required restarting clients. Then things have been quite stable for the last many months.
That's pretty much my experience. I joined at around version 0.27.0, I think. I fully converted from a traditional VPN by around 0.28.0, and things were relatively stable, so I stayed. That said, I think they need to have a nightly and stable release at this point, as I agree with @bmansfie having this run in production, I, first before anything, need stability. Yesterday had a 2-hour downtime because the 0.29.4 client did something when I was applying the access policy and took down all external ports, which absolutely wrecked all my DNS server and all DNS records for a good 4 hours; thankfully, nowadays, it only takes around 2 hours to re-propagate. That said, I'd like for that not to happen again...
I know this is very vague and doesn't provide helpful information in itself, but I just wanted to add in my anecdotal experience that the current instability hasn't shown up in my environment for quite some time.
I wouldn't really call it "anecdotal". I have a monthly maintenance window during which I upgrade all of the packages on the OS, so when I do eventually upgrade, I may jump many minor and patch releases. Because things were more or less stable, I had no issues upgrading to the latest. Right now, all of my servers are sitting on a downgraded version of 0.33.0 as it seems to be the last stable release, at least for the previous 24, before it was 0.29.4. That said, after yesterday, I am fearful of all versions 😅.
I just noticed on a peer that had lost communication with another peer that "Last WireGuard handshake" was hours old and "Last connection update" was minutes so it certainly points to something at the WG level becoming out of sync.
I think you're probably right, I think I probably saw stability issues reappearing around 0.34. I had become quite (probably overly) comfortable with the level of stability over the past months and been happily tracking the latest releases. I don't yet run netbird in a production setting. More of a long term stability test on my homelab "production" services before deploying to real customer facing workloads.
Hmm.. it's funny I have one machine that drops and it's a Windows Server 2022 .. I don't see other clients stop. A simple restart fixes it, but i have to do every day. I have Linux clients and an older Windows SBS server all seem to be ok..Also have Windows 11 clients.. again seem fine.. even my arch desktop is fine. Very odd.
Another data point. A long running ping in screen to keep traffic going over the link appears to keep the peer connected.
It seems like the issue is with the WireGuard handshake. For instance, my Windows 11 PC seemingly struggles to connect to other Linux Server Peers despite everything running the latest Netbird version, in this case, Netbird 0.35.2. One of my Load Balancer servers running Ubuntu 22.04 just refuses to keep the connection to other Linux servers for longer than 5 minutes before dying and needing to be restarted. I don't know what I'm doing wrong, but I always update the management server first and only then move on to the client nodes, first on the Linux servers and then on devices such as PCs/Laptops/Phones.
Hi Everyone, happy New Year!
Hey @mlsmaycon, sorry to ping you directly. Would you like me to run the same steps as listed last time? I will email the logs so you have a better picture. I am approaching a maintenance window for all our org servers and will be able to run a full debugging trace like last time. Also, I need to know if the logging persists across client updates or whether I need to run it first on the old version and then after the upgrade.
I think (in my case at least) this appears to be something triggered by, or relating to relaying.
Previously I was not running the relay in my set up and only running coturn. The peer I was having most trouble with was connecting over relay.
Adding in the new relay service appears to have made that peer more stable for the last 12 hours or so.
Previously I was not running the relay in my set up and only running coturn. The peer I was having most trouble with was connecting over relay.
Hmm, interesting. In my case, I am already running a new relay service. Strangely, some client versions seem to overuse the relay, and some underuse it; since 0.35.0, the client seems to bypass it altogether and go straight for P2P.
Okay, you know what? It's late at night here in the UK, so let me try upgrading and getting at least some logs.
Ok, without looking at the trace logs generated by the client, my anecdotal research log shows this:
4:54 UK7 client upgraded from 0.33.0 to 0.35.2
4.58 UK7 sites show as status 503 down on UK1 - which is still using 0.33.0 client
5:00 UK7 client downgraded back to 0.33.0 - and I am waiting for all sites to recover, which takes around 5 seconds
5:04 UK1 client upgraded to 0.35.2 from 0.33.0
5:08 UK3 websites are shown as down despite only the UK1 client being up to date.
.. some time here I downgraded the uk1 client to 0.33.0
5:19 UK1 client again upgraded to 0.35.2 from 0.33.0
5:23 UK2,3,5 sites went down - they use client 0.33.0 client Uk7 however is still up.
5:25 UK1 client restarted. Sites are going back up
5:37 UK1 client downgraded back to 0.33.0 things are back to normal.
@mlsmaycon I've collected a full trace from UK1 and UK7 using the method listed in https://github.com/netbirdio/netbird/issues/3112#issuecomment-2562361089 and am now parsing it to see if there is anything obvious. I will send it to the support email once I've reviewed everything.
We're encountering the same issue with our Netbird instance. Version 0.33 was incredibly stable for over 90 days, maintaining continuous communication with our peer. However, after upgrading directly to version 0.35, we observed the same problem that @hadleyrich mentioned starting from version 0.34. Although the peer is online and connected to our server, there is no communication. It seems the bug might have been introduced in that version. We've debugged the issue and found a temporary workaround: disabling and re-enabling the policy in access control, which restores communication. I'm happy to provide more details to help resolve this.
We have the same issue with some peers. @fiikra I have tested your workaround but this is not working. when I check the peer with the command netbird status --detail and I check the peer where I have no connection to. I see it's connected but no Last WireGuard handshake.
Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rel://vpn.example.com:33080 <-- here stands my real domain, obviously Last connection update: 25 seconds ago Last WireGuard handshake: - Transfer status (received/sent) 0 B/740 B Quantum resistance: false Routes: - Networks: - Latency: 0s
and I have other peers that just work fine
Status: Connected -- detail -- Connection type: Relayed ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: rel://vpn.example.com:33080 <-- here stands my real domain, obviously Last connection update: 9 minutes, 33 seconds ago Last WireGuard handshake: 1 minute, 27 seconds ago Transfer status (received/sent) 22.0 KiB/16.7 KiB Quantum resistance: false Routes: - Networks: - Latency: 0s
It worked before and all peers mentioned are using version 0.35.2. When I reinstall my client with version 0.27.3 it works. My own peer is installed on windows, the 2 peers from the example are linux
Hey @mlsmaycon, I am beginning to get really frustrated by this. We are getting new releases that introduce more features, but none address the issue of the peer link being dropped. I've sent an email to support with the attached debug trace logs for two servers that keep dropping links after upgrading. Has anyone looked at it? I really don't want to be that kind of person, but at this point, I'm getting frustrated enough that I am this 🤏🏻 far away from trying and perhaps even switching to headscale.
Can you guys check if you have "redundant" ACLs like:
- ACL1: Allow all Peers to ICMP > Peer01
- ACL2: Allow the monitoring Peers to ICMP > Peer01
Toggling one of the ACLs off / on brings connectivity back for me:
I just spoke with someone from the head scale community who has used Netbird before. They suggested disabling rosenpass and rosenpass-permissive modes on the affected clients. After doing that and upgrading all clients, the issue appears to have disappeared—though I will keep monitoring it. I assume the problem is somewhere in the rotation of the rosenpass keys, as the peers drop connection almost exactly 5 minutes after establishing a link.
@rihards-simanovics did you have peers with different versions of NetBird and rosenpass enabled? After the upgrade, did you enable rosenpass again?
@rihards-simanovics did you have peers with different versions of NetBird and rosenpass enabled? After the upgrade, did you enable rosenpass again?
Hi @mlsmaycon, thanks for replying. No, the versions were precisely the same across all peers when the issue occurred. To better illustrate the environment, all 9 peers (Ubuntu 22.04/24.04 servers):
- were upgraded to either
0.36.2or.3, - were upgraded directly from
0.33.0within roughly 2 minutes of one another, - had
rosenpassandrosenpass-premissiveflags set totrue, - were running Ubuntu 24.04 - except the load balancer, which ran Ubuntu 22.04,
- have static IPv4 and IPv6.
When the peer dropped the connection, which happened roughly every 5 minutes, I restarted all of the servers so that if there was anything strange with the OS, it would have been accounted for. However, after around 10 minutes of things going down, I had to revert back to 0.33.0.
By the way, here is a quick update on the stability after disabling rosenpass and rosenpass-permissive mode; all 9 peers have been running 0.36.3 since I posted this comment and nothing dropped connection yet.
Piping in to state that my org is also having this same issue. We do not have rosenpass options enabled. All windows clients, all on versions >0.34.1. Only a subset of users are having issues, and only users who have authentication; Expiration disabled clients have not had any issues. Server version is 0.35.2; have upgraded several times trying to resolve this issue. Typically the effected clients are when coming back from idle, but can be a fresh connection. Wireguard handshake never completes. If i change the peer group it immediately resolves the handshake issue.
I am also affected by this issue (as far I can tell), with netbird status -d sporadically showing no recent WireGuard handshake. The issues are with a specific ssh-enabled server only, the other peers in the network always seem to have recent handshakes listed. The workaround with cycling the access control policy seems to work.
My computer is on the bottom, the problematic peer up top. I have tried various versions though, and a colleague on 35.2 can connect without issue.
Hi, Same issues no rosenpass or rosenpass-premissive activated. My practical fix is to disable and enable policy in the problem group. But need a solution, we can't do that everyday
I think i have the same issue, my issue is that Peer 1 is the node from which I access the resources of Peer 2. However, if I reboot Peer 2 for maintenance, Peer 1 can no longer access the subnets on Peer 2—unless I also reboot Peer 1. Only after restarting Peer 1 do the resources on Peer 2 become accessible again.
Peer 1 can no longer access the subnets on Peer 2—unless I also reboot Peer 1. Only after restarting Peer 1 do the resources on Peer 2 become accessible again.
I have exactly the same problem. Thanks for the tip with peer number 1, I have just restarted it and now access to the resource is working again for the time being.
@pscriptos @SuperKali Can you update to 0.46.0, watch out for the issue/further new versions and report back with results after some time?
We have identified some form of race condition that was partially fixed in https://github.com/netbirdio/netbird/pull/3910 and is still being worked on in https://github.com/netbirdio/netbird/pull/3929
Hi @nazarewk,
thanks for mentioning me. I usually keep all peers updated to the latest version to check if the issue has already been resolved. Unfortunately, recently—and randomly—my Uptime Kuma reported a loss of connectivity to some VPCs from one of the peers, and I had to restart the current peer using netbird service restart.
Let me know if you need any additional information.
Thanks again!
@pscriptos Can you update to
0.46.0, watch out for the issue/further new versions and report back with results after some time?
All affected peers have already been updated to version 0.46.0 a few days ago. When I got stuck with my problem, I looked at this Github repo and saw that at that time the update to version 0.46.0 had just been out for 3 hours. However, I also wrote about it here: #3699