nebula icon indicating copy to clipboard operation
nebula copied to clipboard

🐛 BUG:Nebula cannot obtain the correct dns server address from the system

Open aa51513 opened this issue 2 years ago • 12 comments

What version of nebula are you using?

1.7.2

What operating system are you using?

Linux ( Arm64 )

Describe the Bug

When starting nebula, an error is reported: ERRO[0000] DNS resolution failed for static_map host error="lookup mynebula.server.com on [::1]:53: read udp [::1]:39679->[:: 1]:53: read: connection refused "hostname=mynebula.server.com network=ip4

It looks like it can't get the correct dns server address from the system, but I type dig command and everything is normal:

~ $ dig www.google.com

; <<>> DiG 9.16.41 <<>> www.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2327
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.google.com.			IN	A

;; ANSWER SECTION:
www.google.com.		128	IN	A	104.244.46.52

;; Query time: 30 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Jun 21 13:19:00 CST 2023
;; MSG SIZE  rcvd: 48

As a supplement, the following are the contents of the file /etc/resolv.conf on my server:

nameserver 8.8.8.8
nameserver 8.8.4.4

I am very confused, why Nebula uses [::1]:53 as the address of the dns server, regardless of the system configuration

Please evaluate whether the optional configuration item of dns server address should be added to the configuration file

Logs from affected hosts

ERRO[0000] DNS resolution failed for static_map host error="lookup mynebula.server.com on [::1]:53: read udp [::1]:39679->[:: 1]:53: read: connection refused "hostname=mynebula.server.com network=ip4

Config files from affected hosts

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/nebula.crt
  key: /etc/nebula/nebula.key
static_host_map:
  "10.10.10.1": ["mynebula.server.com:45445"]
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "10.10.10.1"
listen:
  host: "::"
  port: 45445
punchy:
  punch: true
relay:
  am_relay: true
  use_relays: true
tun:
  disabled: true
  dev: nebula
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
  routes:
  unsafe_routes:
logging:
  level: warning
  format: text
firewall:
  outbound_action: drop
  inbound_action: drop
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any

aa51513 avatar Jun 21 '23 05:06 aa51513

Hi @aa51513 -

Allowing configuration of a DNS resolver in the config file sounds like a good idea to me. That being said, I'm unsure why the settings in /etc/resolv.conf would be ignored. From reading Go docs I see this:

The method for resolving domain names, whether indirectly with functions like Dial or directly with functions like LookupHost and LookupAddr, varies by operating system.

On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.

By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.

However, we disable CGO for Nebula builds so I suspect that only the pure Go resolver is in use. If that's the case, and the comment above is correct, I am surprised to hear that your /etc/resolv.conf settings are not being respected. Are you using a .local or mDNS name?

johnmaguire avatar Jun 25 '23 03:06 johnmaguire

Hi @aa51513, are you able to provide the request information? Thanks!

johnmaguire avatar Jul 10 '23 15:07 johnmaguire

ERRO[0000] DNS resolution failed for static_map host error="lookup mynebula.server.com on [::1]:53: read udp [::1]:39679->[:: 1]:53: read: connection refused "hostname=mynebula.server.com network=ip4

I'm sorry that I didn't reply in time these days because of some personal matters. When the above issue occurred, I was using a normal ".com" domain name, neither a .local nor an mDNS name. I was able to add cname records, A records, and AAAA records on the domain management page. I even accessed my domain name through my mobile phone via 4G, and I was able to open my webpage normally, indicating that the problem should not be on the domain name

aa51513 avatar Jul 18 '23 10:07 aa51513

I am also having a lot of DNS issues. On Linux, I sometimes get a long delay before the connection initiates. After around 30 seconds, I will get an error:

ERRO[13680] DNS resolution failed for static_map host     error="lookup url.xyz: i/o timeout" hostname=url.xyz network=ip4

Sometimes it will then just sit there disconnected, although more often than not there will be another message saying the DNS results changed for host list, and it will go and connect.

On other occasions (also Linux) it connects ok, but then intermittently and consistently there will be looping error messages:

ERRO[13290] DNS resolution failed for static_map host     error="lookup url: i/o timeout" hostname= url.xyz network=ip4
INFO[13290] DNS results changed for host list             newSet="map[]" origSet="&map[x.x.x.x:10102:{}]"
INFO[13320] DNS results changed for host list             newSet="map[x.x.x.x:10102:{}]" origSet="&map[]"
ERRO[13680] DNS resolution failed for static_map host     error="lookup url.xyz: i/o timeout" hostname= url.xyz network=ip4
INFO[13680] DNS results changed for host list             newSet="map[]" origSet="&map[x.x.x.x:10102:{}]"
INFO[13710] DNS results changed for host list             newSet="map[x.x.x.x:10102:{}]" origSet="&map[]"
ERRO[14070] DNS resolution failed for static_map host     error="lookup url.xyz: i/o timeout" hostname= url.xyz network=ip4
INFO[14070] DNS results changed for host list             newSet="map[]" origSet="&map[x.x.x.x:10102:{}]"
INFO[14100] DNS results changed for host list             newSet="map[x.x.x.x:10102:{}]" origSet="&map[]"
ERRO[14460] DNS resolution failed for static_map host     error="lookup url.xyz: i/o timeout" hostname= url.xyz network=ip4

On Mac I haven't been able to connect at all, but my Mac is such a mix of different interfaces and experiments it's been really difficult to debug. If I run Nebula in a Docker container on the same system though, it performs the same as above.

The IP and URL that I removed from above are all standard ipv4 (although there is an ipv6 option on there, the IP in Nebula logs is the ipv4 one) and a subdomain. Domain has been active for months so has propagated fully.

Being able to specify DNS servers would be a good step.

maggie44 avatar Nov 12 '23 19:11 maggie44

Still trying to explore this. I can replicate it by changing the DNS entries in resolv.conf on my Mac and see it when using a slow connection. Connecting to a VPN changes resolv.conf and also helps replicate this. After the DNS change on occasion it reports:

ERRO[0060] DNS resolution failed for static_map host     error="lookup 123.xyz: no such host" hostname=123.xyz network=ip4

Then eventually:

INFO[0090] DNS results changed for host list             newSet="map[123.23.23.23:10102:{}]" origSet="&map[]"

and then after another 30 seconds it connects.

I see there is a retry cadence:

https://github.com/slackhq/nebula/pull/879/files

I haven't delved in to the criteria for DNS results changed for host list but it might help if the cadence is lower when there has not yet been a successful DNS lookup, then uses the 30s for subsequent lockups. Also for a connect to be called directly after a DNS results changed for host list if there is not yet a live connection. At the moment, it looks like Nebula is very slow at connecting to lighthouses but I think it is merely the timings of the retries.

I also wonder if the timeout of 200ms is too low for slow connections. I haven't been able to see any improvement by increasing it, but I'm also not sure if there is much benefit to it being that low when slower connections may be using Nebula.

maggie44 avatar Nov 22 '23 20:11 maggie44

Nebula comes with built-in DNS server support via Lighthouse hosts.

Frederic-Zhou avatar Aug 28 '24 12:08 Frederic-Zhou

@maggie44 The error you are seeing is different from the error in the original ticket. Have you tried increasing static_map.lookup_timeout? This is the value associated with the "i/o timeout" message from a slow DNS server. If that doesn't work, let's move the "i/o timeout" issue to a separate ticket.

johnmaguire avatar Aug 29 '24 14:08 johnmaguire

@maggie44 FYI, I've posted a PR here that may improve time-to-recovery in the situation you described. I would like to improve this further in the future (mentioned in the PR): https://github.com/slackhq/nebula/pull/1260

johnmaguire avatar Oct 25 '24 17:10 johnmaguire

i have a peculiar case where it only fails on system startup my systemd service has After=network-online.target i have set lookup_timeout to 10 s which never fails when i manually restart it after system startup i can provide my systemd unit file and nebula config if needed

haras-unicorn avatar Nov 24 '24 07:11 haras-unicorn

I too am getting my logs flooded with these kinds error of messages on my arm64:

level=error msg="DNS resolution failed for static_map host" error="lookup ???.duckdns.org: i/o timeout" hostname=???.duckdns.org network=ip4

Very similar to OP. Tried setting:

static_map:
  cadence: 120s
  lookup_timeout: 500ms

this slowed down the frequency of the errors, but did not resolve it.

erykjj avatar Feb 12 '25 01:02 erykjj

I too am getting my logs flooded with these kinds error of messages on my arm64:

level=error msg="DNS resolution failed for static_map host" error="lookup ???.duckdns.org: i/o timeout" hostname=???.duckdns.org network=ip4

Very similar to OP. Tried setting:

static_map:
  cadence: 120s
  lookup_timeout: 500ms

this slowed down the frequency of the errors, but did not resolve it.

Apologies for this prior "bug" report. I believe the issue lies with duckdns.org not resolving properly. I believe some of their servers are having issues at this time.

erykjj avatar Feb 13 '25 14:02 erykjj

I can confirm these issues as well. As far as I can tell (we're working with a very large number of embedded devices using nebula) it seems like any additional DNS entries in /etc/resolv.conf are completely ignored. If the first nameserver doesn't resolve, no further seems to be used ! We found out with a specific device build that didn't purge /etc/resolv.conf after reboot so our local DNS was still on top besides the ones added by a mobile connection made. Nebula won't be able to connect any lighthouse. This was eventually fixed by removing this first line, then the resolution succeeded. A temporary workaround would be to either use IPs in the static host map or add static hosts to /etc/hosts. But we also would prefer "a normal resolver behaviour" (static wins, resolv entries are all taken into account serially if previous fail).

ScR4tCh avatar Oct 02 '25 10:10 ScR4tCh