netbird icon indicating copy to clipboard operation
netbird copied to clipboard

in the last weeks netbird randomly lost connection and not able to recover

Open lfarkas opened this issue 10 months ago • 29 comments

since v0.36.5 no longer be able to connect other peers. sometimes netbird restart solve the problem sometimes not. netbird status -d show connected but not even a ping works with the peers 100.76.x.x ip address.ps axuf here is a part from the log:

2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-1.relay.netbird.io:443] relay/client/client.go:214: open connection to peer: sha-Dn9xgXi3A/4FEe90jhVUP/dkvMcxA59y/e7x0g3oZO4=
2025-02-14T19:48:03+01:00 INFO client/iface/wgproxy/ebpf/proxy.go:102: turn conn added to wg proxy store: rels://streamline-de-fra1-1.relay.netbird.io:443, endpoint port: :3
2025-02-14T19:48:03+01:00 INFO [peer: +i/q6dNa3AeF/iNJMH9+CbnsTLmFPfN+/K0KUPJI5wI=] client/internal/peer/conn.go:447: created new wgProxy for relay connection: 127.0.0.1:3
2025-02-14T19:48:03+01:00 INFO [peer: +i/q6dNa3AeF/iNJMH9+CbnsTLmFPfN+/K0KUPJI5wI=] client/internal/peer/wg_watcher.go:87: WireGuard watcher started
2025-02-14T19:48:03+01:00 INFO [peer: f+tmDAAoOYRUT/WAoJl0PsqalR4zJvt7ljkxZboO9iE=] client/internal/peer/conn.go:476: start to communicate with peer via relay
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/client.go:164: create new relay connection: local peerID: gsrpCbJwc8lkmNV783rxIHpyj+zZIhy/rFj5HsfVuBY=, local peer hashedID: sha-99JRJjv0
PJBbfBPJzmU0KgWX+n3VVc6ezC48fcixQBE=
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/client.go:170: connecting to relay server
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/dialer/race_dialer.go:64: dialing Relay server via quic
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/dialer/race_dialer.go:64: dialing Relay server via WS
2025-02-14T19:48:03+01:00 INFO [peer: FfiyZKMquYILabBxOquw/jXEuTjhBq6tUvBEPdV3ckY=] client/internal/peer/conn.go:476: start to communicate with peer via relay
2025-02-14T19:48:03+01:00 INFO client/internal/routemanager/client.go:210: New chosen route is co1co8bl0ubs739dfm90 with peer FfiyZKMquYILabBxOquw/jXEuTjhBq6tUvBEPdV3ckY= with score 19990.001000 for network [192.168.0.0/16]
2025-02-14T19:48:03+01:00 INFO [peer: +i/q6dNa3AeF/iNJMH9+CbnsTLmFPfN+/K0KUPJI5wI=] client/internal/peer/conn.go:476: start to communicate with peer via relay
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/dialer/race_dialer.go:89: successfully dialed via: WS
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/dialer/race_dialer.go:75: connection attempt aborted via: quic
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/client.go:186: relay connection established
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-0.relay.netbird.io:443] relay/client/client.go:214: open connection to peer: sha-d6bmxNpKji4X2AM4Syi/oXpY9FJ6J27RG3gTY9ONhdE=
2025-02-14T19:48:03+01:00 INFO client/iface/wgproxy/ebpf/proxy.go:102: turn conn added to wg proxy store: rels://streamline-de-fra1-0.relay.netbird.io:443, endpoint port: :4
2025-02-14T19:48:03+01:00 INFO [peer: hCDjKQBW9TBwsZigTRXxvVzpAYE+ZqDHBol4sOSUMl0=] client/internal/peer/conn.go:447: created new wgProxy for relay connection: 127.0.0.1:4
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-1.relay.netbird.io:443] relay/client/client.go:214: open connection to peer: sha-Asv8+qhh3HsYQgPXy3cIzGzTjlTvEIoTND3nPoVZDgw=
2025-02-14T19:48:03+01:00 INFO client/iface/wgproxy/ebpf/proxy.go:102: turn conn added to wg proxy store: rels://streamline-de-fra1-1.relay.netbird.io:443, endpoint port: :5
2025-02-14T19:48:03+01:00 INFO [peer: RtObgAe/KslyFa/t0a/iGwy7HohRzO8xhNNUPIR1ri8=] client/internal/peer/conn.go:447: created new wgProxy for relay connection: 127.0.0.1:5
2025-02-14T19:48:03+01:00 INFO [peer: hCDjKQBW9TBwsZigTRXxvVzpAYE+ZqDHBol4sOSUMl0=] client/internal/peer/wg_watcher.go:87: WireGuard watcher started
2025-02-14T19:48:03+01:00 INFO [peer: RtObgAe/KslyFa/t0a/iGwy7HohRzO8xhNNUPIR1ri8=] client/internal/peer/wg_watcher.go:87: WireGuard watcher started
2025-02-14T19:48:03+01:00 INFO [relay: rels://streamline-de-fra1-1.relay.netbird.io:443] relay/client/client.go:214: open connection to peer: sha-ULsX413ckuLILuPUeQ8liU9B86RCBgkvFP0SdhMWbUw=
2025-02-14T19:48:03+01:00 INFO client/iface/wgproxy/ebpf/proxy.go:102: turn conn added to wg proxy store: rels://streamline-de-fra1-1.relay.netbird.io:443, endpoint port: :6
2025-02-14T19:48:03+01:00 INFO [peer: 1u25Mrocd2aMv88fUgRnKmM1caynzX+bGTzThCZ3CnE=] client/internal/peer/conn.go:447: created new wgProxy for relay connection: 127.0.0.1:6
2025-02-14T19:48:03+01:00 INFO [peer: 1u25Mrocd2aMv88fUgRnKmM1caynzX+bGTzThCZ3CnE=] client/internal/peer/wg_watcher.go:87: WireGuard watcher started
2025-02-14T19:48:03+01:00 INFO [peer: RtObgAe/KslyFa/t0a/iGwy7HohRzO8xhNNUPIR1ri8=] client/internal/peer/conn.go:476: start to communicate with peer via relay
2025-02-14T19:48:03+01:00 INFO client/internal/routemanager/client.go:210: New chosen route is co1dqj3l0ubs739dfnsg with peer hCDjKQBW9TBwsZigTRXxvVzpAYE+ZqDHBol4sOSUMl0= with score 49990.001000 for network [192.168.0.0/16]
2025-02-14T19:48:03+01:00 INFO [peer: hCDjKQBW9TBwsZigTRXxvVzpAYE+ZqDHBol4sOSUMl0=] client/internal/peer/conn.go:476: start to communicate with peer via relay
2025-02-14T19:48:03+01:00 INFO [peer: 1u25Mrocd2aMv88fUgRnKmM1caynzX+bGTzThCZ3CnE=] client/internal/peer/conn.go:476: start to communicate with peer via relay
2025-02-14T19:48:03+01:00 INFO client/internal/routemanager/client.go:210: New chosen route is co1kv3bl0ubs739dg130 with peer 1u25Mrocd2aMv88fUgRnKmM1caynzX+bGTzThCZ3CnE= with score 0.001000 for network [10.20.0.0/24]
2025-02-14T19:48:03+01:00 INFO client/internal/routemanager/client.go:210: New chosen route is co1kuj3l0ubs739dg11g with peer 1u25Mrocd2aMv88fUgRnKmM1caynzX+bGTzThCZ3CnE= with score 0.001000 for network [10.30.0.0/24]
2025-02-14T19:48:03+01:00 INFO [peer: +i/q6dNa3AeF/iNJMH9+CbnsTLmFPfN+/K0KUPJI5wI=] client/internal/peer/conn.go:328: set ICE to active connection
2025-02-14T19:48:03+01:00 INFO [peer: +i/q6dNa3AeF/iNJMH9+CbnsTLmFPfN+/K0KUPJI5wI=] client/internal/peer/wg_watcher.go:111: WireGuard watcher stopped
2025-02-14T19:48:03+01:00 INFO [peer: f+tmDAAoOYRUT/WAoJl0PsqalR4zJvt7ljkxZboO9iE=] client/internal/peer/conn.go:328: set ICE to active connection
2025-02-14T19:48:03+01:00 INFO [peer: f+tmDAAoOYRUT/WAoJl0PsqalR4zJvt7ljkxZboO9iE=] client/internal/peer/wg_watcher.go:111: WireGuard watcher stopped
2025-02-14T19:48:03+01:00 INFO [peer: FfiyZKMquYILabBxOquw/jXEuTjhBq6tUvBEPdV3ckY=] client/internal/peer/conn.go:328: set ICE to active connection
2025-02-14T19:48:03+01:00 INFO [peer: FfiyZKMquYILabBxOquw/jXEuTjhBq6tUvBEPdV3ckY=] client/internal/peer/wg_watcher.go:111: WireGuard watcher stopped
2025-02-14T19:48:04+01:00 INFO [peer: RtObgAe/KslyFa/t0a/iGwy7HohRzO8xhNNUPIR1ri8=] client/internal/peer/conn.go:328: set ICE to active connection
2025-02-14T19:48:04+01:00 INFO [peer: RtObgAe/KslyFa/t0a/iGwy7HohRzO8xhNNUPIR1ri8=] client/internal/peer/wg_watcher.go:111: WireGuard watcher stopped
2025-02-14T19:48:04+01:00 INFO [peer: 1u25Mrocd2aMv88fUgRnKmM1caynzX+bGTzThCZ3CnE=] client/internal/peer/conn.go:328: set ICE to active connection
2025-02-14T19:48:04+01:00 INFO [peer: 1u25Mrocd2aMv88fUgRnKmM1caynzX+bGTzThCZ3CnE=] client/internal/peer/wg_watcher.go:111: WireGuard watcher stopped
2025-02-14T19:48:05+01:00 INFO [peer: hCDjKQBW9TBwsZigTRXxvVzpAYE+ZqDHBol4sOSUMl0=] client/internal/peer/conn.go:328: set ICE to active connection
2025-02-14T19:48:05+01:00 INFO [peer: hCDjKQBW9TBwsZigTRXxvVzpAYE+ZqDHBol4sOSUMl0=] client/internal/peer/wg_watcher.go:111: WireGuard watcher stopped
2025-02-14T19:48:05+01:00 INFO [peer: f+tmDAAoOYRUT/WAoJl0PsqalR4zJvt7ljkxZboO9iE=] client/internal/peer/guard/guard.go:84: start reconnect loop...
2025-02-14T19:48:05+01:00 INFO [peer: +i/q6dNa3AeF/iNJMH9+CbnsTLmFPfN+/K0KUPJI5wI=] client/internal/peer/guard/guard.go:84: start reconnect loop...
2025-02-14T19:48:05+01:00 INFO [peer: FfiyZKMquYILabBxOquw/jXEuTjhBq6tUvBEPdV3ckY=] client/internal/peer/guard/guard.go:84: start reconnect loop...
2025-02-14T19:48:06+01:00 INFO [peer: Yg/JDeFsAfMnue9KOTNm77L0AlG1g3Y6pYIm3KhUxyw=] client/internal/peer/guard/guard.go:84: start reconnect loop...
2025-02-14T19:48:06+01:00 INFO [peer: 1u25Mrocd2aMv88fUgRnKmM1caynzX+bGTzThCZ3CnE=] client/internal/peer/guard/guard.go:84: start reconnect loop...
2025-02-14T19:48:06+01:00 INFO [peer: RtObgAe/KslyFa/t0a/iGwy7HohRzO8xhNNUPIR1ri8=] client/internal/peer/guard/guard.go:84: start reconnect loop...
2025-02-14T19:48:06+01:00 INFO [peer: Kc8hGcw4uOpvTwgvTste9cdhtPpmMLsZDeOYSITNGnk=] client/internal/peer/guard/guard.go:84: start reconnect loop...
2025-02-14T19:53:02+01:00 INFO client/internal/peer/guard/sr_watcher.go:94: network changes detected by ICE agent

lfarkas avatar Feb 14 '25 19:02 lfarkas

peer is connected but can't be ping:

# ping fox
PING fox.netbird.cloud (100.76.171.201) 56(84) bytes of data.
^C
--- fox.netbird.cloud ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3109ms

from netbird status -d:

 fox.netbird.cloud:
  NetBird IP: 100.76.171.201
  Public key: FfiyZKMquYILabBxOquw/jXEuTjhBq6tUvBEPdV3ckY=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/srflx
  ICE candidate endpoints (Local/Remote): 10.5.5.217:51820/185.199.30.141:14255
  Relay server address: rels://streamline-de-fra1-2.relay.netbird.io:443
  Last connection update: 18 minutes, 19 seconds ago
  Last WireGuard handshake: -
  Transfer status (received/sent) 12.1 KiB/20.3 KiB
  Quantum resistance: true
  Routes: -
  Networks: -
  Latency: 8.003627ms

OS: linux/amd64
Daemon version: 0.36.7
CLI version: 0.36.7
Management: Connected to https://api.netbird.io:443
Signal: Connected to https://signal.netbird.io:443
Relays: 
  [stun:stun.netbird.io:5555] is Available
  [turns:turn.netbird.io:443?transport=tcp] is Available
  [rels://streamline-de-fra1-2.relay.netbird.io:443] is Available
Nameservers: 
  [192.168.208.1:53] for [int.vidux.hu] is Unavailable, reason: 1 error occurred:
	* read udp 10.5.5.217:50996->192.168.208.1:53: i/o timeout
  [10.30.0.1:53] for [szeged.vidux.hu] is Available
FQDN: dell.netbird.cloud
NetBird IP: 100.76.111.32/16
Interface type: Kernel
Quantum resistance: true (permissive)
Routes: -
Networks: -
Peers count: 5/8 Connected

lfarkas avatar Feb 22 '25 08:02 lfarkas

@lfarkas, can you please run the following command while repeating the ping test?

netbird debug for 5m -S

Then please share the generated bundle file?

mlsmaycon avatar Feb 22 '25 09:02 mlsmaycon

To be honest it is a serious problem for us. In the last few month it happened regularly not to be able access to the work network from home and someone must restart the NetBird service in the internal network... Sometimes even in this case the connection is not working.

lfarkas avatar Feb 22 '25 11:02 lfarkas

Hey just wanted to chime in that I'm having the same issue when deploying via Kubernetes, really keen on a fix for this.

nickz-LR avatar Feb 26 '25 22:02 nickz-LR

I'm experiencing the same issue. I deployed a self-hosted instance using the Helm chart from totmicro/helms. There might be a problem with the relay server configuration, as my peers seem to disconnect after a while (they appear to lose connection with the relay server):

Relays: 
  [rels://vpn.my.domain:443/relay] is Unavailable, reason: relay connection is not established

On relay server I got a lot of following errors

ERRO relay/server/relay.go:121: failed to handshake: validate sha-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (x.x.x.x:yyyy): expired token

Kamaradeivanov avatar Mar 13 '25 08:03 Kamaradeivanov

we don't use a self hosted version but the cloud version. and it's still happen and very annoying. since when i can't access to the remote site there is no other way to restart netbird just go to the site and reboot the machine or restart netbird service.

is there any progress with it?

lfarkas avatar Mar 14 '25 18:03 lfarkas

we don't use a self hosted version but the cloud version. and it's still happen and very annoying. since when i can't access to the remote site there is no other way to restart netbird just go to the site and reboot the machine or restart netbird service.

is there any progress with it?

Try this one; disable and enable policy. Please share the result here

ugurtam avatar Mar 14 '25 18:03 ugurtam

what policy and where and how ? anyway i send you the whole debug log 3 weeks ago above. did you look into that?

lfarkas avatar Mar 14 '25 18:03 lfarkas

i've got 7 connected peer and from it i can ping 5 and can't 2. there is only one policy in https://app.netbird.io/access-control the default. if i disable and enable it still 2 can't ping but not the same 2:-)

lfarkas avatar Mar 14 '25 18:03 lfarkas

and today i updated all client to 0.38.0

lfarkas avatar Mar 14 '25 18:03 lfarkas

after play a bit with policy disable/enable sometimes able to access this critical peer (which is always online in the peer list) for a few minutes or second, but this is never longer the 5 minutes and after then no longer works. after a new disable enable it's works again for a few minutes but with this click i disconnect my whole netbird network...

lfarkas avatar Mar 14 '25 19:03 lfarkas

@lfarkas, we will prepare a debugging version for you to try tomorrow, as it seems like the fixes from recent versions are not helping your case.

mlsmaycon avatar Mar 14 '25 19:03 mlsmaycon

x86_64 rpm please

lfarkas avatar Mar 14 '25 20:03 lfarkas

@lfarkas, you can download the packages from the link:

https://github.com/netbirdio/netbird/actions/runs/13881767994/artifacts/2759669965

this file will have builder artifacts for the PR: https://github.com/netbirdio/netbird/pull/3517. You will find the rpm installer there, too.

In case of an issue, please make sure that the agent is running for at least 10 minutes, then generate a bundle with logs for analyzis with the command:

netbird debug bundle -S

Also, please share which peers the node can't connect to.

mlsmaycon avatar Mar 16 '25 09:03 mlsmaycon

So I've to install it into one client and not all? And the other client can be the normal 0.38 version?

lfarkas avatar Mar 16 '25 09:03 lfarkas

If you can install on all affected clients, that will increase our chances of getting helpful logs

mlsmaycon avatar Mar 16 '25 10:03 mlsmaycon

@lfarkas, the last build had the potential to cause a panic. You can use this one instead: https://github.com/netbirdio/netbird/actions/runs/13884417374/artifacts/2760236240

mlsmaycon avatar Mar 16 '25 15:03 mlsmaycon

these are both contains the asame commit id: netbird_0.38.1-SNAPSHOT-9c4fdec9_linux_amd64.rpm anyway before i install it i can't ping 100.76.121.209 (which status is connected) after i install it ping start to work after about 5 minutes it's no longer works ie ping no longer works.syste after this i stop the normal systemd service and while i looking into which command to start ping in the other window start to working and turn out something start netbird service!? i stopped again with: systemctl stop netbird.service and about a minutes later ping works again and netbird runs again!? why i can't stop it? after a systemctl disable --now netbird.service still start itself in about a minutes. is there any why how can i stop it??? anyway if i'm fast enough: root@wolf:~# systemctl stop netbird.service ;netbird debug bundle -S Job for netbird.service canceled. /tmp/netbird.debug.1526428740.zip i hope i can run in test mode. my local netbird ip is: 100.76.24.179

the remote client's: NetBird IP: 100.76.121.209 Public key: f+tmDAAoOYRUT/WAoJl0PsqalR4zJvt7ljkxZboO9iE=

and of course when i start it in this mode ping is working, but after 183 packet it's no longer works again, here is the debug output (and i only install this rpm only my local client. if you need it on the remote client too let me know.

lfarkas avatar Mar 16 '25 16:03 lfarkas

but i don;t know it's a valid output or not since this command return immediately:

root@wolf:~# systemctl stop netbird.service ;netbird debug bundle -S
Job for netbird.service canceled.
/tmp/netbird.debug.1526428740.zip

lfarkas avatar Mar 16 '25 17:03 lfarkas

@lfarkas, sorry, I didn't get why you tried to stop the agent. The agent should be running and failing when getting the bungle.

mlsmaycon avatar Mar 16 '25 17:03 mlsmaycon

ok but the agent is ALWAYS running since it's not possible to turn it off. imho it's a problem.

here is another dump (when the ping is not working and i'm sure if i restart the service it's working again for a few minutes): netbird.debug.3319656448.zip

is there anything what can i do?

lfarkas avatar Mar 16 '25 18:03 lfarkas

@lfarkas Szia! Can we schedule a call to go through some details?

pappz avatar Mar 16 '25 22:03 pappz

@lfarkas can you confirm if the issue persist with the latest version and rosenpass?

mlsmaycon avatar Apr 17 '25 12:04 mlsmaycon

to be honest i'm not really like to test it. at least not before easter. if i reconfigure my vpn setting to rosen and then still not working i'll no longer be able to access to my office network (which happened before) and there is no way to recover from this state... may be after easter...

lfarkas avatar Apr 17 '25 13:04 lfarkas

Hi there,

i am having exactly the same issue. After a Re-Install it should work for a couple if mins and afterwards its stops working. After enable/disbale the Policy i get this issues:

client/internal/peer/handshaker.go:79: wait for remote offer confirmation on both servers.

I am running the current Version 0.43.1 on an Debian

baldy2811 avatar May 01 '25 22:05 baldy2811

Hi,

please forget what i said.

Chain fail2ban-SIP (1 references)
target     prot opt source               destination
REJECT     all  --  100.114.165.225      anywhere             reject-with icmp-port-unreachable
REJECT     all  --  100.114.188.68       anywhere             reject-with icmp-port-unreachable

baldy2811 avatar May 01 '25 23:05 baldy2811

Hi all, I just stumbled across this issue and wondered if I would be able to help out future people as I also had similar issues. I wrote about this in my comment on #3852.

We were having random disconnects after we enabled Rosenpass, so to test this theory I disabled Rosenpass across all peers and set a pre-shared key instead. The random disconnects completely stopped after this.

Therefore I would suggest disabling quantum resistance in the hope that doing so will enable your peers to remain connected.

Markovich01 avatar Jun 03 '25 22:06 Markovich01