netmaker icon indicating copy to clipboard operation
netmaker copied to clipboard

[Bug]: Relayed nodes cannot reach peers behind second relay

Open kellervater opened this issue 2 years ago • 3 comments

Contact Details

[email protected]

What happened?

This bug has been discussed in Discord channel Netmaker Support - #general with @afeiszli .

Here's the summary:

I've finally got a Netmaker network up and running but I'm facing some strange behavior in connectivity between peers.

My Setup is the following:

  • 2 real servers in different data centers running Proxmox VE. Both servers are running netclient.
  • each server provides 3 VMs which shall be able to talk to each other to form a HA k3s cluster. The VMs run behind NAT in subnet 192.168.0.0
  • 1 Netmaker Server Node running publicly on EC2

With a basic mesh setup only the 2 bare-metal host servers can reach every peer. Peers can only reach other peers within the same data center. (Yes, I activated UDP hole punching).

So I thought I make both host servers a relay node each for the VMs running on them. This ended up looking like the first picture. image

Thing is... that now only the 2 Relay Servers were able to reach each other and that's it.

Now the fun part... If I only make one of the 2 servers a relay server like on the second picture, everything works like a charm. image

And it doesn't matter if aio1 or aio2 is the relay for their VMs. It's kinda mutually exclusive. Both as relay don't work. None as relay doesn't work but a single one works. It's a bit odd.

Has someone of you faced something similar or am I missing something obvious here?

Private (netmaker addresses) of each host: aio1: 10.236.196.7

  • master1: 10.236.196.4
  • worker1: 10.236.196.5
  • worker2: 10.236.196.6

aio2: 10.236.196.8

  • master2: 10.236.196.1
  • worker3: 10.236.196.2
  • worker4: 10.236.196.3

The 2-relay setup allows connectivity between nodes on each relay but no interconnectivity between the 2 servers. Except for the relay servers themselves. aio1 can reach aio2 and vice versa.

No showstopper for me atm, as long as I don't add a 3rd datacenter node with "natted" vms, I assume.

If I can provide further input, pls let me know!

Besides this: thanks for this great product and your comprehensive documentation. This was EXACTLY what I've been looking for for weeks now.

Version

v0.11.1

What OS are you using?

Linux

Relevant log output

No response

Contributing guidelines

  • [X] Yes, I did.

kellervater avatar Mar 15 '22 16:03 kellervater

Investigating. In mean time if you can test with 0.12.1 to confirm it is still an issue, would be appreciated.

afeiszli avatar Mar 25 '22 12:03 afeiszli

Yes, with 0.12.1 the issue is still there. Network seems to recover quite quickly after removing 2nd relay. At least my k8s cluster started working again after a while.

kellervater avatar Mar 29 '22 08:03 kellervater

FYI: Just tried 0.12.2 and issue still persists. Dig the dark mode though 👍

One strange finding here: If i undo the 2nd relay the network doesn't recover. Every remaining relayed node cannot reach anything anymore. I need to recreate the entire network with a single relay to be functional again.

kellervater avatar Apr 01 '22 11:04 kellervater

Just tried this with v0.16.0 and the issue has been resolved. image

In this scenario node relayed is relayed node relay and node lxc is relayed by node node1 ... lxc also is an egress gateway with a gateway range of 10.0.3.0/24. relayed can ping lxc and hosts in the egress range. lxc can ping relayed.

Note: it does take some time after creating a relay before the ping will go through (in order of 30-60 secs)

mattkasun avatar Sep 26 '22 15:09 mattkasun