netbird icon indicating copy to clipboard operation
netbird copied to clipboard

Latency base routing does not work

Open hurricanehrndz opened this issue 1 year ago • 6 comments

Describe the problem

latency base routing was suppose to be added with the following patch: https://github.com/netbirdio/netbird/pull/1732

Unfortunately I believe the following if statement is wrong, but I am hesitant to open a PR to fix it since it will make the underlying wireguard connection unstable:

https://github.com/netbirdio/netbird/pull/1732/files#diff-67387c2b097eeb87387ab2575884677a2f9ee1c53b9e8674dcfa4936463c23f9R155

There also seems to be a few race conditions that could impact the route analyzes poorly.

In my testing sometimes the slower connection is picked because it was the first to connect. Other times the slower connection is picked because latency for one of the later connected peers is missing.

I will try and add logs, as I collect them.

To Reproduce

Steps to reproduce the behavior: Fix if statement, set a more reasonable latency forgiveness number such as 0.015 (15 ms) rather than 100ms and watch what happens. You will eventually end up disconnected from all peers

Expected behavior

I expected the best route to be picked, unless it is a route that is only 100ms different as per the the latency base routing patch intended

Are you using NetBird Cloud?

Nope

NetBird version

modified version of 0.27.3

NetBird status -d output:

Peers detail:
 lon1.peers.n.com:
  NetBird IP: 100.113.20.218
  Public key: DdiC7cXVeE409wdJO1bQjFQ8JYP52X+/HO46FtkTQSI=
  Status: Connecting
  -- detail --
  Connection type:
  Direct: false
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Last connection update: 2024-04-29 10:29:36
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false
  Routes: -
  Latency: 135.14475ms

 sfo2.peers.n.com:
  NetBird IP: 100.113.35.69
  Public key: S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=
  Status: Connecting
  -- detail --
  Connection type:
  Direct: false
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Last connection update: 2024-04-29 10:29:38
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false
  Routes: -
  Latency: 59.2695ms

 iad1.peers.n.com:
  NetBird IP: 100.113.74.116
  Public key: MGJ+hGFni8uJSgVsF5AZI9GJCSEaOfJOUJ9joDxbhBI=
  Status: Connecting
  -- detail --
  Connection type:
  Direct: false
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Last connection update: 2024-04-29 10:29:39
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false
  Routes: -
  Latency: 69.943916ms

Daemon version: 0.27.3.1009
CLI version: 0.27.3.1009
Management: Connected to https://netbird.n.com:443
Signal: Connected to https://netbird.n.com:443
Relays:
  [stun:stunvpn-ec2.n.com:3478] is Available
  [turn:stunvpn-ec2.n.com:3478?transport=udp] is Unavailable, reason: allocate: all retransmissions failed for fGXwf/+OxEZN791L
Nameservers:
  [10.29.2.45:53, 10.29.2.46:53, 10.29.2.47:53] for [n.com, corp.n.com, dev.n.com] is Available
  [10.29.2.45:53, 10.29.2.46:53, 10.29.2.47:53] for [.] is Available
FQDN: hx7yg952h5-89.peers.n.com
NetBird IP: 100.113.101.234/16
Interface type: Userspace
Quantum resistance: false
Routes: -
Peers count: 0/3 Connected

Additional context

I have explicitly disable turn on my coturn instance to force direct connections

hurricanehrndz avatar Apr 29 '24 16:04 hurricanehrndz

HI @hurricanehrndz, I think you are right. The condition will keep the worse route. I updated the tests and will create a fix soon. As for the 100ms between the routes, I reduced that to 10ms. The idea was to avoid frequently alternating routes if their latency is similar. 10ms should be enough.

pascal-fischer avatar Apr 29 '24 17:04 pascal-fischer

There is a race condition if a peer just connected the latency might still be 0 if the route selection happens before the first update. However, this should not be an issue as we will recalculate the routes once we receive a new latency. In the future we might need to reduce the frequency at which the latency is updated as in bigger networks this could cause permanent recalculation and therefore wasted resources.

If you see other race conditions please let me know.

pascal-fischer avatar Apr 29 '24 17:04 pascal-fischer

@pascal-fischer thanks for the quick response, with latency base routing fixed eventually this occurs, I added way more logging to try and track the issue. Are relay connections necessary?

2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:120: peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8= has 63.008083ms latency
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:145: analyzed network: STAGE_IPS - 172.19.36.0/22, on peer: S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, current score: 0, chosen score: 99002.936991917, chosen peer: co6ojai9l7bha9qbpgdg:co6mgqa9l7bha9qbpf70
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:120: peer MGJ+hGFni8uJSgVsF5AZI9GJCSEaOfJOUJ9joDxbhBI= has 73.212334ms latency
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:145: analyzed network: STAGE_IPS - 172.19.36.0/22, on peer: MGJ+hGFni8uJSgVsF5AZI9GJCSEaOfJOUJ9joDxbhBI=, current score: 99002.926787666, chosen score: 99002.936991917, chosen peer: co6ojai9l7bha9qbpgdg:co6mgqa9l7bha9qbpf70
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:156: Current route is not most optimal, comparing route scores: current: 99002.931787666 vs new: 99002.936991917
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:184: triggered route state update for Peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, state: Connected
2024-04-29T11:45:25-06:00 DEBG client/internal/routemanager/client.go:184: triggered route state update for Peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, state: Connected
2024-04-29T11:45:25-06:00 INFO client/internal/routemanager/client.go:162: migrating route to new routing peer: S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8=, for network: 172.19.36.0/22, with score: 99002.936991917

2024-04-29T11:45:25-06:00 WARN client/internal/peer/conn.go:513: got an error while sending the punch packet, err: write udp4 0.0.0.0:43197->pulbic_ip:51820: use of closed network connection
2024-04-29T11:45:26-06:00 WARN client/internal/peer/conn.go:513: got an error while sending the punch packet, err: write udp4 0.0.0.0:43197->public_ip:51820: use of closed network connection
2024-04-29T11:45:30-06:00 DEBG client/internal/routemanager/routemanager.go:90: Decreasing route ref count 4 for prefix coturn_ip/32
2024-04-29T11:45:30-06:00 DEBG client/internal/relay/relay.go:89: turn probe error from turn:stunvpn-ec2.n.com:3478?transport=udp: allocate: all retransmissions failed for y/z+d12+C8ncq0G5
2024-04-29T11:45:30-06:00 DEBG client/internal/engine.go:1298: received relay probe request, healthy: false
2024-04-29T11:45:30-06:00 DEBG client/internal/engine.go:1305: received wg probe request
2024-04-29T11:45:30-06:00 DEBG client/internal/engine.go:1311: failed to get wg stats for peer DdiC7cXVeE409wdJO1bQjFQ8JYP52X+/HO46FtkTQSI=: find peer info: config key not found: last_handshake_time_sec
2024-04-29T11:45:31-06:00 DEBG client/internal/peer/conn.go:662: peer S3TburlUFqMPX3PLDlKhQnD/NLxq9B9YCzJ7mAZozh8= ICE ConnectionState has changed to Disconnected

This may be unrelated to latency base routing, but it latency base routing definitely helps to surface whatever the bug is

hurricanehrndz avatar Apr 29 '24 18:04 hurricanehrndz

Hello @hurricanehrndz can you share the full masked logs and the output from netbird status -d when all routing peers are connected?

mlsmaycon avatar Apr 29 '24 18:04 mlsmaycon

I can try to catch it, it does disconnect quickly

hurricanehrndz avatar Apr 29 '24 19:04 hurricanehrndz

Ok, in that case, can you share logs after 2 reconnections?

mlsmaycon avatar Apr 29 '24 19:04 mlsmaycon

@mlsmaycon maybe we should close this issue and I will open a new one specifically scoped ICE candidates going to a disconnect state without recovery

hurricanehrndz avatar Apr 30 '24 20:04 hurricanehrndz

Ok, thanks @hurricanehrndz

mlsmaycon avatar May 02 '24 07:05 mlsmaycon

@mlsmaycon I found the other bugs that stop this from working on userspace wireguard interfaces, I will open PRs that should be merged before this does

hurricanehrndz avatar May 02 '24 09:05 hurricanehrndz

Thanks, @hurricanehrndz;

Please let me know if you have any questions about the bugs you've found before opening the PRs.

BTW, we just merged #1903.

mlsmaycon avatar May 02 '24 09:05 mlsmaycon

Awesome not a big deal, new release should be avoided till fixes are merged. I did document one of the bugs in the self-hosted channel.

hurricanehrndz avatar May 02 '24 10:05 hurricanehrndz

Let me open a new issue

hurricanehrndz avatar May 02 '24 10:05 hurricanehrndz