π BUG: Unable to reconnect after MacBook sleeps
What version of nebula are you using? (nebula -version)
1.9.4
What operating system are you using?
Mac
Describe the Bug
Nebula connects and works fine on Mac. I am running from within a Docker container, using bridge networking. After I close my laptop, go away, and come back it occasionally fails to reconnect (after the Mac sleeps/hibernates/whatever Mac does), showing Handshake timed out. From inside the container, I can reach all the UDP sockets of the lighthouse, if I run the same container again with the same config it starts and connects fine, but for some reason the original container that was left running is unable to connect.
It seems calling a config reload fixes the issue without restarting the container, although as itβs a hard issue to replicate I have only been able to confirm that once:
func (c *C) ReloadConfig() {
Is there something that the ReloadConfig() is doing that isn't done during a timeout? Perhaps NAT tables shift in Docker when it wakes back up and the UDP connection needs re-establishing? Any thoughts to help debug this one appreciated.
Logs from affected hosts
time="2024-12-07T20:52:27Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3249700910 localIndex=3249700910 remoteIndex=0 udpAddrs="[...:...]" vpnIp=100.64.0.1
time="2024-12-07T20:52:27Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2722712268 localIndex=2722712268 remoteIndex=0 udpAddrs="[...:...]" vpnIp=100.64.0.2
time="2024-12-07T20:52:27Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=451846405 localIndex=451846405 remoteIndex=0 udpAddrs="[...:...]" vpnIp=100.64.0.3
time="2024-12-07T20:52:33Z" level=info msg="Handshake timed out" durationNs=6803663170 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3249700910 localIndex=3249700910 remoteIndex=0 udpAddrs="[...:...]" vpnIp=100.64.0.1
time="2024-12-07T20:52:33Z" level=info msg="Handshake timed out" durationNs=6803892545 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2722712268 localIndex=2722712268 remoteIndex=0 udpAddrs="[...:...]" vpnIp=100.64.0.2
time="2024-12-07T20:52:33Z" level=info msg="Handshake timed out" durationNs=6803942128 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=451846405 localIndex=451846405 remoteIndex=0 udpAddrs="[...:...]" vpnIp=100.64.0.3
Config files from affected hosts
Potentially related to rebinding that I think only ever landed for ios? https://github.com/slackhq/nebula/pull/394
Hi @maggie44 - if you're running on Docker Desktop for macOS, Docker is running inside a Linux VM. I'm wondering if something is going wonky with this networking stack. I wonder if you can try pulling any relevant logs from the Docker VM? #394 definitely may be relevant.
@maggie44 were you able to grab any relevant logs from the Docker VM by chance? I'm a little unsure how to proceed here.
I haven't used it on Mac for a while. Will start it up again and try and reproduce.
Here are some logs I managed to grab from reproducing the issue today. This is using the current build from main branch (i.e. certs v2). They are from dockers dns.log and coincide with the outage, which is the one that showed a pattern and potentially useful.
There are no gaps in the logs, i.e. these are as they appear in the logs exactly. The gap between 9.10 and 10.01 is presumably the sleep.
[2025-04-06T09:08:22.509641970Z][init.leases] [;my-lighthouse3.com. IN A] = [123.456.789.2]
[2025-04-06T09:08:52.605112678Z][init.leases] Add 123.456.789.2 for my-lighthouse3.com.
[2025-04-06T09:08:52.605174970Z][init.leases] Add 123.456.789.3 for my-lighthouse1.com.
[2025-04-06T09:08:52.605279095Z][init.leases] [;my-lighthouse1.com. IN A] = [123.456.789.3]
[2025-04-06T09:08:52.605244386Z][init.leases] [;my-lighthouse3.com. IN A] = [123.456.789.2]
[2025-04-06T09:08:52.605302303Z][init.leases] Add 123.456.789.1 for my-lighthouse2.com.
[2025-04-06T09:08:52.605327095Z][init.leases] [;my-lighthouse2.com. IN A] = [123.456.789.1]
[2025-04-06T09:09:23.356901345Z][init.leases] Add 123.456.789.2 for my-lighthouse3.com.
[2025-04-06T09:09:23.356953303Z][init.leases] Add 123.456.789.3 for my-lighthouse1.com.
[2025-04-06T09:09:23.356972678Z][init.leases] [;my-lighthouse3.com. IN A] = [123.456.789.2]
[2025-04-06T09:09:23.357005511Z][init.leases] [;my-lighthouse1.com. IN A] = [123.456.789.3]
[2025-04-06T09:09:23.357024178Z][init.leases] Add 123.456.789.1 for my-lighthouse2.com.
[2025-04-06T09:09:23.357043970Z][init.leases] [;my-lighthouse2.com. IN A] = [123.456.789.1]
[2025-04-06T09:09:54.548798136Z][init.leases] [;my-lighthouse1.com. IN A] = []
[2025-04-06T09:09:54.548890386Z][init.leases] [;my-lighthouse2.com. IN A] = []
[2025-04-06T09:09:54.548927136Z][init.leases] [;my-lighthouse3.com. IN A] = []
[2025-04-06T09:10:24.576386595Z][init.leases] [;my-lighthouse1.com. IN A] = []
[2025-04-06T09:10:24.576664720Z][init.leases] [;my-lighthouse3.com. IN A] = []
[2025-04-06T09:10:24.577070220Z][init.leases] [;my-lighthouse2.com. IN A] = []
[2025-04-06T10:01:26.798386761Z][init.leases] [;my-lighthouse3.com. IN A] = []
[2025-04-06T10:01:26.798461220Z][init.leases] [;my-lighthouse2.com. IN A] = []
[2025-04-06T10:01:26.798489678Z][init.leases] [;my-lighthouse1.com. IN A] = []
[2025-04-06T10:20:41.577708261Z][init.leases] Add 123.456.789.1 for my-lighthouse2.com.
[2025-04-06T10:20:41.578477220Z][init.leases] [;my-lighthouse2.com. IN A] = [123.456.789.1]
[2025-04-06T10:20:41.578591636Z][init.leases] Add 123.456.789.2 for my-lighthouse3.com.
[2025-04-06T10:20:41.578820261Z][init.leases] [;my-lighthouse3.com. IN A] = [123.456.789.2]
[2025-04-06T10:20:41.578933636Z][init.leases] Add 123.456.789.3 for my-lighthouse1.com.
[2025-04-06T10:20:41.579033428Z][init.leases] [;my-lighthouse1.com. IN A] = [123.456.789.3]
[2025-04-06T10:21:11.269917345Z][init.leases] Add 123.456.789.3 for my-lighthouse1.com.
[2025-04-06T10:21:11.270419428Z][init.leases] [;my-lighthouse1.com. IN A] = [123.456.789.3]
[2025-04-06T10:21:11.270501636Z][init.leases] Add 123.456.789.2 for my-lighthouse3.com.
[2025-04-06T10:21:11.270530011Z][init.leases] [;my-lighthouse3.com. IN A] = [123.456.789.2]
time="2025-04-06T08:51:39Z" level=info msg="Handshake message received" certName=cert-1 durationNs=70478792 fingerprint=<REDACTED> handshake="map[stage:2 style:ix_psk0]" initiatorIndex=3424991473 issuer=<REDACTED> remoteIndex=3424991473 responderIndex=62464583 sentCachedPackets=0 udpAddr="<REDACTED_IP:PORT>" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T08:51:55Z" level=info msg="Tunnel status" certName=cert-3 localIndex=863097762 remoteIndex=1968826778 tunnelCheck="map[method:active state:dead]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T08:51:55Z" level=info msg="Tunnel status" certName=cert-1 localIndex=1404133448 remoteIndex=1366269626 tunnelCheck="map[method:active state:dead]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T09:09:37Z" level=info msg="Tunnel status" certName=cert-2 localIndex=3306653380 remoteIndex=2653868903 tunnelCheck="map[method:active state:dead]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T09:09:37Z" level=info msg="Tunnel status" certName=cert-3 localIndex=1724271082 remoteIndex=2164768455 tunnelCheck="map[method:active state:dead]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T09:09:37Z" level=info msg="Tunnel status" certName=cert-1 localIndex=3424991473 remoteIndex=62464583 tunnelCheck="map[method:active state:dead]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T09:09:54Z" level=error msg="DNS resolution failed for static_map host" error="lookup <REDACTED_DOMAIN> on <REDACTED_DNS>: no such host" hostname=<REDACTED_DOMAIN> network=ip4
time="2025-04-06T09:09:54Z" level=info msg="DNS results changed for host list" newSet="map[]" origSet="&map[<REDACTED_IP:PORT>:{}]"
time="2025-04-06T09:09:54Z" level=error msg="DNS resolution failed for static_map host" error="lookup <REDACTED_DOMAIN> on <REDACTED_DNS>: no such host" hostname=<REDACTED_DOMAIN> network=ip4
time="2025-04-06T09:09:54Z" level=info msg="DNS results changed for host list" newSet="map[]" origSet="&map[<REDACTED_IP:PORT>:{}]"
time="2025-04-06T09:09:54Z" level=error msg="DNS resolution failed for static_map host" error="lookup <REDACTED_DOMAIN> on <REDACTED_DNS>: no such host" hostname=<REDACTED_DOMAIN> network=ip4
time="2025-04-06T09:09:54Z" level=info msg="DNS results changed for host list" newSet="map[]" origSet="&map[<REDACTED_IP:PORT>:{}]"
time="2025-04-06T09:10:24Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3749061216 localIndex=3749061216 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T09:10:24Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1792710042 localIndex=1792710042 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T09:10:24Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1124387700 localIndex=1124387700 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T09:10:24Z" level=error msg="DNS resolution failed for static_map host" error="lookup <REDACTED_DOMAIN> on <REDACTED_DNS>: no such host" hostname=<REDACTED_DOMAIN> network=ip4
time="2025-04-06T09:10:24Z" level=error msg="DNS resolution failed for static_map host" error="lookup <REDACTED_DOMAIN> on <REDACTED_DNS>: no such host" hostname=<REDACTED_DOMAIN> network=ip4
time="2025-04-06T09:10:24Z" level=error msg="DNS resolution failed for static_map host" error="lookup <REDACTED_DOMAIN> on <REDACTED_DNS>: no such host" hostname=<REDACTED_DOMAIN> network=ip4
time="2025-04-06T10:01:03Z" level=info msg="Handshake timed out" durationNs=6885060003 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3749061216 localIndex=3749061216 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:01:03Z" level=info msg="Handshake timed out" durationNs=6884504169 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1792710042 localIndex=1792710042 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:01:03Z" level=info msg="Handshake timed out" durationNs=6884520587 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1124387700 localIndex=1124387700 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:01:26Z" level=error msg="DNS resolution failed for static_map host" error="lookup <REDACTED_DOMAIN> on <REDACTED_DNS>: no such host" hostname=<REDACTED_DOMAIN> network=ip4
time="2025-04-06T10:01:26Z" level=error msg="DNS resolution failed for static_map host" error="lookup <REDACTED_DOMAIN> on <REDACTED_DNS>: no such host" hostname=<REDACTED_DOMAIN> network=ip4
time="2025-04-06T10:01:26Z" level=error msg="DNS resolution failed for static_map host" error="lookup <REDACTED_DOMAIN> on <REDACTED_DNS>: no such host" hostname=<REDACTED_DOMAIN> network=ip4
time="2025-04-06T10:20:41Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2847314796 localIndex=2847314796 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:20:41Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2324299670 localIndex=2324299670 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:20:41Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2143674360 localIndex=2143674360 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:20:41Z" level=info msg="DNS results changed for host list" newSet="map[<REDACTED_IP:PORT>:{}]" origSet="&map[]"
time="2025-04-06T10:20:41Z" level=info msg="DNS results changed for host list" newSet="map[<REDACTED_IP:PORT>:{}]" origSet="&map[]"
time="2025-04-06T10:20:41Z" level=info msg="DNS results changed for host list" newSet="map[<REDACTED_IP:PORT>:{}]" origSet="&map[]"
time="2025-04-06T10:20:48Z" level=info msg="Handshake timed out" durationNs=6809037044 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2847314796 localIndex=2847314796 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:20:48Z" level=info msg="Handshake timed out" durationNs=6809265461 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2324299670 localIndex=2324299670 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:20:48Z" level=info msg="Handshake timed out" durationNs=6809309753 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2143674360 localIndex=2143674360 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:21:58Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=748526291 localIndex=748526291 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:21:58Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1918039502 localIndex=1918039502 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:21:58Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2600733905 localIndex=2600733905 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:22:05Z" level=info msg="Handshake timed out" durationNs=6706686961 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=748526291 localIndex=748526291 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:22:05Z" level=info msg="Handshake timed out" durationNs=6707060252 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1918039502 localIndex=1918039502 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
time="2025-04-06T10:22:05Z" level=info msg="Handshake timed out" durationNs=6707242961 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2600733905 localIndex=2600733905 remoteIndex=0 udpAddrs="[<REDACTED_IP:PORT>]" vpnAddrs="[<REDACTED_VPN_ADDR>]"
Next time this occurs, can you run kill -SIGQUIT <pid> inside the Docker container, where the pid is the Nebula process? This should dump a stack trace, which should appear in the Docker logs. I'd like to check on the running goroutines.
Still occurring @maggie44 ?