nebula icon indicating copy to clipboard operation
nebula copied to clipboard

🐛 BUG: no connectivity between nebula nodes when using different CAs

Open timteka opened this issue 4 months ago • 3 comments

What version of nebula are you using? (nebula -version)

1.9.7

What operating system are you using?

Linux

Describe the Bug

We've decided to split our staff into two teams. Team1 and Team2. Issued additional CA (say ca-2025-team2.crt). Also issued another CA as a replacement of old one and concatenated all of them into one ca.crt ( ca-2024.crt + ca-2025.crt + ca-2025-team2.crt). And.... got connectivity problems between various nodes. Lighthouse successfully handshakes all of them, but hosts can't ping each other, or sometimes only few ping replies are successful, then stuck. Ok, we resign team2 certs by ca-2025.crt and voila - all is good again. What can be the course of our problem?

Worth mentioning, nodes of Team2 can ping each other. Moreover, some of them can ping nodes of Team1. But almost none of them can ping the most distant nodes - our office. And nodes of Team2 are located in aws tokyo.

timteka avatar Dec 17 '25 11:12 timteka

Hi @timteka!

Can you post the contents of the CAs (nebula-cert -print -path your-ca.crt), some certificates signed with each of them, and the config files of the hosts that failed to communicate? (please redact your private keys!)

One thing that can cause this behavior is if the addresses you assign to nodes don't all have a subnet in common. As a quick example:

  • 10.2.0.1/24 and 10.3.0.1/24 CANNOT communicate
  • 10.2.0.1/8 and 10.3.0.1/8 CAN communicate

Is there any chance you have address conflicts between the Nebula overlay and your various underlay networks?

Please forgive me if these questions sound "obvious", but you know much more about your network that I do, and it's always good to check the basics first. Feel free to ping me in #support in our slack instance too, if you think discussing there would be quicker/easier for you.

JackDoan avatar Dec 17 '25 16:12 JackDoan

Dear Jack, Everything was working properly before... 1 or 2 days. And the problem is only with aws instances. Maybe the punch-holing or sth stopped functioning as before. Our last changes were those games with CAs. But seems like it's not. We even made firewall holes for incoming 4242. Maybe it's also not correct. Do ordinary nodes need to have any incoming ports specifically opened? Or only the lighthouses, sitting on white ips, and relays? And what if ee opened some incoming ports like 4242, could we've broken "normal" way of things?

timteka avatar Dec 17 '25 19:12 timteka

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/l4.crt
  key: /etc/nebula/l4.key

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "10.145.0.1"
    - "10.145.0.2"

  local_allow_list:
    interfaces:
      'br0': true
    "10.10.1.16/32": true

relay:
  am_relay: false
  use_relays: false

static_host_map:
  "10.145.0.1": ["wan_ip1:42420"]
  "10.145.0.2": ["wan_ip2:42420"]

listen:
  host: 0.0.0.0
  port: 42420

punchy:
  punch: true
  respond: true

tun:
  dev: nebula0
  unsafe_routes:
    - route: 10.146.0.0/25  # ovpn admins via lighthousey
      via: 10.145.0.1
    - route: 10.146.1.0/25  # ovpn quants via lighthousey
      via: 10.145.0.1
    - route: 10.146.0.128/25  # ovpn admins via s2
      via: 10.145.0.14

logging:
  level: debug
  format: text

stats:
  type: prometheus
  listen: 0.0.0.0:42421
  path: /metrics
  namespace: prometheusns
  subsystem: nebula
  interval: 10s

firewall:
  outbound_action: drop
  inbound_action: drop

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    # Default permissive inbound for team1
    - port: any
      proto: any
      groups:
        - team1
    # Default permissive inbound for team2
    - port: any
      proto: any
      groups:
        - team2

timteka avatar Dec 17 '25 20:12 timteka

Seems like we had "the One direction" connectivity problems. Successful handshakes with the lighthouse from both nodes, but no traffic between them afterwards. The relaying helped us.

timteka avatar Dec 20 '25 07:12 timteka

Hi @timteka - are you still experiencing issues or did relay solve you problem? Some hosts behind certain NATs are unable to communicate without relays. If one side of the handshake has a port forwarded / firewall opened or is behind certain Nebula-friendly NATs, you should be able to avoid relays. If this isn't happening, please verify that AWS or iptables/ufw/nft aren't blocking traffic to that port.

If you are still having issues can you provide logs from the affected hosts during attempted handshakes?

johnmaguire avatar Dec 22 '25 17:12 johnmaguire

If you are still having issues can you provide logs from the affected hosts during attempted handshakes?

Dear John, we're getting some strange stuff. E.g. in one cloud (say AliBaba/AWS) in the same vpc, same security groups and all, several instances operate properly via nebula, few - not.

We haven't switched all instances to relaying, we do it gradually. I'll try to gather some logs.

timteka avatar Dec 22 '25 17:12 timteka

@timteka I'm not sure if this is relevant to your problem, but there is a typo in your config:

  addvertise_addr:
    - "<your-ip-here>:42420"

should be:

  advertise_addrs:
    - "<your-ip-here>:42420"

https://nebula.defined.net/docs/config/lighthouse/#lighthouseadvertise_addrs

JackDoan avatar Dec 22 '25 18:12 JackDoan