nebula
nebula copied to clipboard
Question: NAT Setup
I seem to be missing something important. If I setup a mesh of hosts with all direct public IP addresses, it works fine. However, if I have a network with a light house(public IP), then all nodes behind NAT, they will not connect to each other. The lighthouse is able to communicate with all hosts, but hosts are not able to communicate with each other.
Watching the logs I see connections trying to be made to both the NAT public, and the private IPs.
I have enabled punchy and punch back, but does not seem to help.
Hope it is something simple?
Also, to note in this setup all nodes are behind different NATs on different networks. Hub and spoke with the hub being the lighthouse and spokes going to hosts on different networks.
My best guess (because I just messed this up in a live demo), is that am_lighthouse may be set to "true" on the individual nodes.
Either way, can you post your lighthouse config and one of your node configs?
(feel free to replace any sensitive IP/config bits, just put consistent placeholders in their place)
Hi, I have the same issue. My lighthouse is on a DigitalOcean droplet with public IP. My MacBook and Linux Laptop at home are on the same network both connected to lighthouse. I can ping lighthouse from both laptop, but I cannot ping from one laptop to the other.
Lighthouse config
pki:
ca: /data/cert/nebula/ca.crt
cert: /data/cert/nebula/lighthouse.crt
key: /data/cert/nebula/lighthouse.key
static_host_map:
"192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
am_lighthouse: true
interval: 60
hosts:
listen:
host: 0.0.0.0
port: 4242
punchy: true
tun:
dev: neb0
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
- port: 443
proto: tcp
groups:
- laptop
Macbook config
pki:
ca: /Volumes/code/cert/nebula/ca.crt
cert: /Volumes/code/cert/nebula/mba.crt
key: /Volumes/code/cert/nebula/mba.key
static_host_map:
"192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "LIGHTHOUSE_PUBLIC_IP"
punchy: true
tun:
dev: neb0
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
logging:
level: debug
format: text
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
- port: 443
proto: tcp
groups:
- laptop
Linux laptop config
pki:
ca: /data/cert/nebula/ca.crt
cert: /data/cert/nebula/server.crt
key: /data/cert/nebula/server.key
static_host_map:
"192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "LIGHTHOUSE_PUBLIC_IP"
punchy: true
listen:
host: 0.0.0.0
port: 4242
tun:
dev: neb0
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
- port: 443
proto: tcp
groups:
- laptop
@nfam thanks for sharing the config. My next best guess is that nat isn't reflecting and for some reason nodes also aren't finding each other locally.
Try setting the local_range
config setting on the two laptops, which can give them a hint about the local network range to use for establishing the direct tunnel.
@nfam similar setup. Public lighthouse on digital ocean, laptop on home nat, and server in AWS behind a NAT. Local and AWS are using different private ranges(though overlap should be handled)
@rawdigits setting local_range
does not help.
I stopped nebula on both laptops, set log on lighthouse to debug
, cleared log and restarted lighthouse (no node connected to). Following is the log I got.
time="2019-11-23T20:05:18Z" level=info msg="Main HostMap created" network=192.168.100.1/24 preferredRanges="[]" time="2019-11-23T20:05:18Z" level=info msg="UDP hole punching enabled" time="2019-11-23T20:05:18Z" level=info msg="Nebula interface is active" build=1.0.0 interface=neb0 network=192.168.100.1/24 time="2019-11-23T20:05:18Z" level=debug msg="Error while validating outbound packet: packet is not ipv4, type: 6" packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 183 226 137 252 10 196 21 15 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 27 133 0 0 0 0]"
My Config:
nebula-cert sign -name "lighthouse" -ip "192.168.100.1/24"
nebula-cert sign -name "laptop" -ip "192.168.100.101/24" -groups "laptop"
nebula-cert sign -name "server" -ip "192.168.100.201/24" -groups "server"
Lighthouse:
pki:
ca: /etc/nebula/ca.crt
cert: /etc/nebula/lighthouse.crt
key: /etc/nebula/lighthouse.key
static_host_map:
"192.168.100.1": ["167.71.175.250:4242"]
lighthouse:
am_lighthouse: true
interval: 60
listen:
host: 0.0.0.0
port: 4242
punchy: true
tun:
dev: nebula1
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 12m
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
Laptop:
pki:
# The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca'
ca: /etc/nebula/ca.crt
cert: /etc/nebula/laptop.crt
key: /etc/nebula/laptop.key
static_host_map:
"192.168.100.1": ["167.71.175.250:4242"]
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "192.168.100.1"
listen:
host: 0.0.0.0
port: 0
punchy: true
tun:
dev: nebula1
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 12m
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
Server:
pki:
ca: /etc/nebula/ca.crt
cert: /etc/nebula/server.crt
key: /etc/nebula/server.key
static_host_map:
"192.168.100.1": ["167.71.175.250:4242"]
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "192.168.100.1"
listen:
host: 0.0.0.0
port: 0
punchy: true
tun:
dev: nebula1
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 12m
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
With this setup, both server and laptop can ping lighthouse, lighhouse can ping server and laptop, but laptop cannot ping server and server cannot ping laptop.
I get messages such as this as it's trying to make the connection:
INFO[0006] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0007] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0009] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0011] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0012] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0014] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0016] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
@nfam similar error, not sure it's the problem
Error while validating outbound packet: packet is not ipv4, type: 6 packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 139 176 20 9 146 65 14 250 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 60 66 0 0 0 0]" DEBU[0066] Error while validating outbound packet: packet is not ipv4, type: 6 packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 139 176 20 9 146 65 14 250 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 60 66 0 0 0 0]"
@jatsrt
The Error while validating outbound packet
can mostly be ignored. Just some types of packet nebula doesn't support bouncing off.
As far as the handshakes, for some reason hole punching isn't working. A few things to try:
- Add
punch_back: true
on the "server" and "laptop" nodes. - explicitly allow all UDP in to the "server" node from the internet (via AWS security groups, just as a test)
- verify
iptables
isn't blocking anything.
Also It appears the logs with the handshake messages are from the laptop? If so can you also share nebula logs from the server as it tries to reach the laptop?
Thanks!
Aha, @nfam I think I spotted the config problem.
instead of
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "LIGHTHOUSE_PUBLIC_IP"
it should be
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "192.168.100.1"
adding #40 to cover accidental misconfiguration noted above.
@rawdigits yes, it is. Now both laptops can ping to each other. Thanks!
@rawdigits
- added punch back on "server" and "laptop"
- security group for that node is currently wide open for all protocols
- No iptables on any of these nodes, base ubuntu server for testing
Server log:
time="2019-11-24T00:25:21Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:22Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:22Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:23Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:24Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:25Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:26Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:27Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:28Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:30Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
So, tried a few more setups, just comes down to what seems like if the two hosts that are trying to communicate with each other are both on different networks and both behind NAT, it will not work.
If the lighthouse does not facilitate the communication/tunneling, this would make sense, but is it meant to be a limitation?
Dual NAT scenario is a bit tricky, possibly room for improvement from nebula's perspective there. Do you have details on the type of NATs you are dealing with?
@nbrownus nothing crazy, I've done multiple AWS VPC NAT gateways with hosts behind them and they cannot connect. I've also tried "home" NAT(google WiFi router based NAT), with no success.
From a networking perspective, I get why it's "tricky" was hoping there was some trick nebula was doing.
@rawdigits can speak to the punching better than I can. If you are having problems in AWS then we can get a test running and sort out the issues.
Yeah, so all my tests have had at least one host behind an AWS NAT Gateway
Longshot, but one more thing to try until I set up an AWS NAT GW: set the UDP port on all nodes to 4242 and let NAT remap it. One ISP I've dealt with blocks the random ephemeral udp ports above 32,000, presumably because they think every high UDP port is bittorrent.
Probably won't work, but easy to test..
@rawdigits same issue
Network combination: Lighthouse - Digital Ocean NYC3 - Public IP Server - AWS - Oregon - Private VPC with AWS NAT Gateway (172.31.0.0/16) Laptop - Verizon FIOS With Google WIFI Router NAT (192.168.1.0/24) Server2(added later to test) - AWS - Ohio Private VPC with AWS NAT Gateway (10.200.200.0/24)
I added in a second server in a different VPC on AWS to remove the FIOS variable, and had the same results, with server and server2 trying to communicate
INFO[0065] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0066] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
INFO[0067] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0069] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
INFO[0071] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0072] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
@jatsrt I'll stand up a testbed this week to explore what may be the cause of the issue. Thanks!
My Config:
nebula-cert sign -name "lighthouse" -ip "192.168.100.1/24"
nebula-cert sign -name "laptop" -ip "192.168.100.101/24" -groups "laptop"
nebula-cert sign -name "server" -ip "192.168.100.201/24" -groups "server"
Lighthouse:
pki: ca: /etc/nebula/ca.crt cert: /etc/nebula/lighthouse.crt key: /etc/nebula/lighthouse.key static_host_map: "192.168.100.1": ["167.71.175.250:4242"] lighthouse: am_lighthouse: true interval: 60 listen: host: 0.0.0.0 port: 4242 punchy: true tun: dev: nebula1 mtu: 1300 logging: level: info format: text firewall: conntrack: tcp_timeout: 12m udp_timeout: 3m default_timeout: 10m max_connections: 100000 outbound: - port: any proto: any host: any inbound: - port: any proto: icmp host: any
Laptop:
pki: # The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca' ca: /etc/nebula/ca.crt cert: /etc/nebula/laptop.crt key: /etc/nebula/laptop.key static_host_map: "192.168.100.1": ["167.71.175.250:4242"] lighthouse: am_lighthouse: false interval: 60 hosts: - "192.168.100.1" listen: host: 0.0.0.0 port: 0 punchy: true tun: dev: nebula1 mtu: 1300 logging: level: info format: text firewall: conntrack: tcp_timeout: 12m udp_timeout: 3m default_timeout: 10m max_connections: 100000 outbound: - port: any proto: any host: any inbound: - port: any proto: icmp host: any
Server:
pki: ca: /etc/nebula/ca.crt cert: /etc/nebula/server.crt key: /etc/nebula/server.key static_host_map: "192.168.100.1": ["167.71.175.250:4242"] lighthouse: am_lighthouse: false interval: 60 hosts: - "192.168.100.1" listen: host: 0.0.0.0 port: 0 punchy: true tun: dev: nebula1 mtu: 1300 logging: level: info format: text firewall: conntrack: tcp_timeout: 12m udp_timeout: 3m default_timeout: 10m max_connections: 100000 outbound: - port: any proto: any host: any inbound: - port: any proto: icmp host: any
With this setup, both server and laptop can ping lighthouse, lighhouse can ping server and laptop, but laptop cannot ping server and server cannot ping laptop.
I get messages such as this as it's trying to make the connection:
INFO[0006] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201 INFO[0007] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201 INFO[0009] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201 INFO[0011] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201 INFO[0012] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201 INFO[0014] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201 INFO[0016] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
I have got the same situation. node_A <----> lighthouse OK node_B <----> lighthouse OK node_A < ----> node_B Not work, cannot ping each other.
But I found, node_A and node_B can communicate with each other ONLY if both are connected to the same router, such as the same WiFi router.
PS punch_back: true on both node_A and node_B.
No firewall on node_A, node_B and lighthouse.
hole punch very difficult and random
I also can't get nebula to work properly when both nodes are behind a typical NAT (Technically PAT) regardless of any port pinning I do in the config. They happily connect to the lighthouse I have in AWS but it seems like something isn't working properly. I've got punchy and punchback enabled on everything and it doesn't seem to help. I've tried setting the port on the nodes to 0, and also trying the same port that lighthouse is listening on.
The nodes have no issues connecting to each other over the MPLS, but we don't want that (performance reasons)
Edit: To add a bit more detail, even Meraki's AutoVPN can't deal with this. In their situation the "hub" needs to be told it's public IP and a fixed port that is open inbound. I'd be fine with that as an option, and may be the only reliable one if both nodes are behind different NATs.
Another option I had considered, what if we could use the lighthouses to hairpin traffic? I'd much rather pay AWS for the bandwidth than have to deal with unfriendly NATs everywhere.
I did a bit more research, and it appears that the "AWS Nat Gateway" uses Symmetric NAT, which isn't friendly to hole punching of any kind. NAT gateways also don't appear to support any type of port forwarding, so fixing this by statically assigning and forwarding a port doesn't appear to be an option.
A NAT instance would probably work, but I realize that's probably not a great option. One thing I recommend considering would be to give instances a routable IP address, but disallow all inbound traffic. This wouldn't greatly change the security of your network, since you still aren't allowing any unsolicited packets to reach the hosts, but would allow hole punching to work properly.
I don't think NAT so much is the issue but PAT (port translation). Unfortunately with that you can't predict what your public port will be and hole punching becomes impossible if both ends are behind a similar PAT. I'm going to do some testing, but I think that as long as 1 of 2 nodes has a 1:1 NAT (no port translation) a public IP on the node directly isn't a concern.
If I get particularly ambitious I may attempt to whip up some code in lighthouse to detect when one/both nodes are behind a PAT and throw a warning saying that this won't work out of the box.
If I get particularly ambitious I may attempt to whip up some code in lighthouse to detect when one/both nodes are behind a PAT and throw a warning saying that this won't work out of the box
I've thought about this before. You need at least 2 lighthouses, and I think it's best to implement as a flag on the non-lighthouses (when you query the lighthouses for a host, if you get results with the same IP but different ports then you know the remote is problematic).
I haven't dug into the handshake code but if you include the source port in the handshake the lighthouse can compare that to what it sees. If they differ you know something in the middle is doing port translation.
Aha, @nfam I think I spotted the config problem.
instead of
lighthouse: am_lighthouse: false interval: 60 hosts: - "LIGHTHOUSE_PUBLIC_IP"
it should be
lighthouse: am_lighthouse: false interval: 60 hosts: - "192.168.100.1"
I bet this is also my issue... will test it soon. That section is confusing 😕
That was not a fix - I had it configured like this already. After more testing I think what I have is hole punching issue with my NAT.
- The lighthouse is a DigitalOcean droplet with a public IP and open port 4242 via UFW. This seems fine.
- My laptop is behind a regular consumer Netgear router with whatever NAT that has.
- Even with punchy and punch back enabled I can't connect. I can see both the laptop and the lighthouse trying to handshake with each other endlessly. It seems like they are trying to punch back to each other and failing.
- If I open a firewall port 4242 to my laptop's internal IP things start to work fine. But this kind of defeats the purpose of trying to use this in the first place.