Netbird relay connection stale for some peers (workaround found)
Hello
With netbird self hosted version 0.45.1, peers version 0.45.3 and 0.36.5 that are relayed due to CGNAT issues (one peer is a 5G router, other peer is a windows PC behind corporate firewall) after a while the relay becomes "stale" in the sense that you cannot ping anymore between the peers, yet it says it's connected:
$ netbird status -d
pictet-nvr1.netbird.stvs:
NetBird IP: 100.70.94.175
Public key: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4=
Status: Connected
-- detail --
Connection type: Relayed
ICE candidate (Local/Remote): -/-
ICE candidate endpoints (Local/Remote): -/-
Relay server address: rels://netbird.stvs.com:443
Last connection update: 7 hours, 9 minutes ago
Last WireGuard handshake: 7 hours, 10 minutes ago
Transfer status (received/sent) 711.3 MiB/18.1 GiB
Quantum resistance: false
Routes: -
Networks: -
Latency: 52.905573ms
$ wg show
peer: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4=
endpoint: 127.0.0.1:38500
allowed ips: 100.70.94.175/32
latest handshake: 7 hours, 13 minutes, 32 seconds ago
transfer: 711.28 MiB received, 18.11 GiB sent
persistent keepalive: every 25 seconds
As you see the latest handshake is way too old. A simple workaround is to stop/start netbird, but that kills all other connections (the PC is connected to many routers). Another workaround is to remove problematic router from policy group & add it again to force an update, but having to handle that manually is annoying.
I guess one could also wg set his way into removing the offending peer, and netbird would recreate the wireguard peer? So maybe I can monitor latest handshakes and "kill" the peers that are stuck?
Any ideas welcome.
I found this which is interesting, but seems netbird already does the right thing:
https://www.reddit.com/r/WireGuard/comments/k3d1hc/latest_handshake_few_hours_ago/
Just to clarify the setup:
Netbird runs on multiple 5G routers (Teltonika TRB500) and on multiple servers (windows). The connexions are relayed due to CGNAT/firewall issues.
One of these server records cameras served through the multiple routers.
Almost every night, some of the routers relayed connexions become stale and thus the cameras are unreachable. Simply restarting netbird fixes the issues.
From the other servers most of the time the connexions to the routers are not stale, but it also happens from time to time.
This problematic server is a VM that runs with by different provider so maybe the network issues are mainly due to this other provider, but my guess is that it has more to do with the wireguard tunnel not being correctly detected as not working (e.g 5G router IP changed, 5G connection glitches, etc).
Meh, I though it was the wireguard tunnel but it seems deeper than that:
When peer is unreachable:
peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw=
endpoint: 127.1.189.16:51820
allowed ips: 100.70.189.16/32
transfer: 0 B received, 148 B sent
persistent keepalive: every 25 seconds
When peer is reachable:
peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw=
endpoint: 127.1.189.16:51820
allowed ips: 100.70.189.16/32
latest handshake: 28 seconds ago
transfer: 796.04 KiB received, 247.33 KiB sent
persistent keepalive: every 25 seconds
I removed/recreated the peer using plain wg set commands but it does not reconnect the peer.
The only thing working at this point is netbird down/up or editing the peer policy so netbird "resets" the config.
Should I give 0.46.0 a try?
I removed/recreated the peer using plain
wg setcommands but it does not reconnect the peer.
I'm pretty sure it uses elaborate negotiation process to establish connectivity. I wouldn't expect wg set to have any chance of working unless the Peer was directly reachable over the internet.
You can always try the 0.46.0 but after looking briefly at the notes, I don't see anything particularly relevant there.
@nazarewk thanks.
I'm trying to find a workaroud so I only reset the stale peer instead of the whole netbird connection. Any idea? Removing & adding the wireguard peer seemed smart but I guess it's a dead end.
Hum, forwarding UDP 51820 from WAN to peer does not seem to help P2P connection. Any idea what to try?
I'll reopen this issue following the template and providing debug logs.