🐛 BUG: no connectivity between nebula nodes when using different CAs
What version of nebula are you using? (nebula -version)
1.9.7
What operating system are you using?
Linux
Describe the Bug
We've decided to split our staff into two teams. Team1 and Team2. Issued additional CA (say ca-2025-team2.crt). Also issued another CA as a replacement of old one and concatenated all of them into one ca.crt ( ca-2024.crt + ca-2025.crt + ca-2025-team2.crt). And.... got connectivity problems between various nodes. Lighthouse successfully handshakes all of them, but hosts can't ping each other, or sometimes only few ping replies are successful, then stuck. Ok, we resign team2 certs by ca-2025.crt and voila - all is good again. What can be the course of our problem?
Worth mentioning, nodes of Team2 can ping each other. Moreover, some of them can ping nodes of Team1. But almost none of them can ping the most distant nodes - our office. And nodes of Team2 are located in aws tokyo.
Hi @timteka!
Can you post the contents of the CAs (nebula-cert -print -path your-ca.crt), some certificates signed with each of them, and the config files of the hosts that failed to communicate? (please redact your private keys!)
One thing that can cause this behavior is if the addresses you assign to nodes don't all have a subnet in common. As a quick example:
- 10.2.0.1/24 and 10.3.0.1/24 CANNOT communicate
- 10.2.0.1/8 and 10.3.0.1/8 CAN communicate
Is there any chance you have address conflicts between the Nebula overlay and your various underlay networks?
Please forgive me if these questions sound "obvious", but you know much more about your network that I do, and it's always good to check the basics first. Feel free to ping me in #support in our slack instance too, if you think discussing there would be quicker/easier for you.
Dear Jack, Everything was working properly before... 1 or 2 days. And the problem is only with aws instances. Maybe the punch-holing or sth stopped functioning as before. Our last changes were those games with CAs. But seems like it's not. We even made firewall holes for incoming 4242. Maybe it's also not correct. Do ordinary nodes need to have any incoming ports specifically opened? Or only the lighthouses, sitting on white ips, and relays? And what if ee opened some incoming ports like 4242, could we've broken "normal" way of things?
pki:
ca: /etc/nebula/ca.crt
cert: /etc/nebula/l4.crt
key: /etc/nebula/l4.key
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "10.145.0.1"
- "10.145.0.2"
local_allow_list:
interfaces:
'br0': true
"10.10.1.16/32": true
relay:
am_relay: false
use_relays: false
static_host_map:
"10.145.0.1": ["wan_ip1:42420"]
"10.145.0.2": ["wan_ip2:42420"]
listen:
host: 0.0.0.0
port: 42420
punchy:
punch: true
respond: true
tun:
dev: nebula0
unsafe_routes:
- route: 10.146.0.0/25 # ovpn admins via lighthousey
via: 10.145.0.1
- route: 10.146.1.0/25 # ovpn quants via lighthousey
via: 10.145.0.1
- route: 10.146.0.128/25 # ovpn admins via s2
via: 10.145.0.14
logging:
level: debug
format: text
stats:
type: prometheus
listen: 0.0.0.0:42421
path: /metrics
namespace: prometheusns
subsystem: nebula
interval: 10s
firewall:
outbound_action: drop
inbound_action: drop
outbound:
- port: any
proto: any
host: any
inbound:
# Default permissive inbound for team1
- port: any
proto: any
groups:
- team1
# Default permissive inbound for team2
- port: any
proto: any
groups:
- team2
Seems like we had "the One direction" connectivity problems. Successful handshakes with the lighthouse from both nodes, but no traffic between them afterwards. The relaying helped us.
Hi @timteka - are you still experiencing issues or did relay solve you problem? Some hosts behind certain NATs are unable to communicate without relays. If one side of the handshake has a port forwarded / firewall opened or is behind certain Nebula-friendly NATs, you should be able to avoid relays. If this isn't happening, please verify that AWS or iptables/ufw/nft aren't blocking traffic to that port.
If you are still having issues can you provide logs from the affected hosts during attempted handshakes?
If you are still having issues can you provide logs from the affected hosts during attempted handshakes?
Dear John, we're getting some strange stuff. E.g. in one cloud (say AliBaba/AWS) in the same vpc, same security groups and all, several instances operate properly via nebula, few - not.
We haven't switched all instances to relaying, we do it gradually. I'll try to gather some logs.
@timteka I'm not sure if this is relevant to your problem, but there is a typo in your config:
addvertise_addr:
- "<your-ip-here>:42420"
should be:
advertise_addrs:
- "<your-ip-here>:42420"
https://nebula.defined.net/docs/config/lighthouse/#lighthouseadvertise_addrs