🐛 BUG:Nebula cannot obtain the correct dns server address from the system
What version of nebula are you using?
1.7.2
What operating system are you using?
Linux ( Arm64 )
Describe the Bug
When starting nebula, an error is reported:
ERRO[0000] DNS resolution failed for static_map host error="lookup mynebula.server.com on [::1]:53: read udp [::1]:39679->[:: 1]:53: read: connection refused "hostname=mynebula.server.com network=ip4
It looks like it can't get the correct dns server address from the system, but I type dig command and everything is normal:
~ $ dig www.google.com
; <<>> DiG 9.16.41 <<>> www.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2327
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;www.google.com. IN A
;; ANSWER SECTION:
www.google.com. 128 IN A 104.244.46.52
;; Query time: 30 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Jun 21 13:19:00 CST 2023
;; MSG SIZE rcvd: 48
As a supplement, the following are the contents of the file /etc/resolv.conf on my server:
nameserver 8.8.8.8
nameserver 8.8.4.4
I am very confused, why Nebula uses [::1]:53 as the address of the dns server, regardless of the system configuration
Please evaluate whether the optional configuration item of dns server address should be added to the configuration file
Logs from affected hosts
ERRO[0000] DNS resolution failed for static_map host error="lookup mynebula.server.com on [::1]:53: read udp [::1]:39679->[:: 1]:53: read: connection refused "hostname=mynebula.server.com network=ip4
Config files from affected hosts
pki:
ca: /etc/nebula/ca.crt
cert: /etc/nebula/nebula.crt
key: /etc/nebula/nebula.key
static_host_map:
"10.10.10.1": ["mynebula.server.com:45445"]
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "10.10.10.1"
listen:
host: "::"
port: 45445
punchy:
punch: true
relay:
am_relay: true
use_relays: true
tun:
disabled: true
dev: nebula
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
routes:
unsafe_routes:
logging:
level: warning
format: text
firewall:
outbound_action: drop
inbound_action: drop
conntrack:
tcp_timeout: 12m
udp_timeout: 3m
default_timeout: 10m
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
Hi @aa51513 -
Allowing configuration of a DNS resolver in the config file sounds like a good idea to me. That being said, I'm unsure why the settings in /etc/resolv.conf would be ignored. From reading Go docs I see this:
The method for resolving domain names, whether indirectly with functions like Dial or directly with functions like LookupHost and LookupAddr, varies by operating system.
On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.
By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.
However, we disable CGO for Nebula builds so I suspect that only the pure Go resolver is in use. If that's the case, and the comment above is correct, I am surprised to hear that your /etc/resolv.conf settings are not being respected. Are you using a .local or mDNS name?
Hi @aa51513, are you able to provide the request information? Thanks!
ERRO[0000] DNS resolution failed for static_map host error="lookup mynebula.server.com on [::1]:53: read udp [::1]:39679->[:: 1]:53: read: connection refused "hostname=mynebula.server.com network=ip4
I'm sorry that I didn't reply in time these days because of some personal matters. When the above issue occurred, I was using a normal ".com" domain name, neither a .local nor an mDNS name. I was able to add cname records, A records, and AAAA records on the domain management page. I even accessed my domain name through my mobile phone via 4G, and I was able to open my webpage normally, indicating that the problem should not be on the domain name
I am also having a lot of DNS issues. On Linux, I sometimes get a long delay before the connection initiates. After around 30 seconds, I will get an error:
ERRO[13680] DNS resolution failed for static_map host error="lookup url.xyz: i/o timeout" hostname=url.xyz network=ip4
Sometimes it will then just sit there disconnected, although more often than not there will be another message saying the DNS results changed for host list, and it will go and connect.
On other occasions (also Linux) it connects ok, but then intermittently and consistently there will be looping error messages:
ERRO[13290] DNS resolution failed for static_map host error="lookup url: i/o timeout" hostname= url.xyz network=ip4
INFO[13290] DNS results changed for host list newSet="map[]" origSet="&map[x.x.x.x:10102:{}]"
INFO[13320] DNS results changed for host list newSet="map[x.x.x.x:10102:{}]" origSet="&map[]"
ERRO[13680] DNS resolution failed for static_map host error="lookup url.xyz: i/o timeout" hostname= url.xyz network=ip4
INFO[13680] DNS results changed for host list newSet="map[]" origSet="&map[x.x.x.x:10102:{}]"
INFO[13710] DNS results changed for host list newSet="map[x.x.x.x:10102:{}]" origSet="&map[]"
ERRO[14070] DNS resolution failed for static_map host error="lookup url.xyz: i/o timeout" hostname= url.xyz network=ip4
INFO[14070] DNS results changed for host list newSet="map[]" origSet="&map[x.x.x.x:10102:{}]"
INFO[14100] DNS results changed for host list newSet="map[x.x.x.x:10102:{}]" origSet="&map[]"
ERRO[14460] DNS resolution failed for static_map host error="lookup url.xyz: i/o timeout" hostname= url.xyz network=ip4
On Mac I haven't been able to connect at all, but my Mac is such a mix of different interfaces and experiments it's been really difficult to debug. If I run Nebula in a Docker container on the same system though, it performs the same as above.
The IP and URL that I removed from above are all standard ipv4 (although there is an ipv6 option on there, the IP in Nebula logs is the ipv4 one) and a subdomain. Domain has been active for months so has propagated fully.
Being able to specify DNS servers would be a good step.
Still trying to explore this. I can replicate it by changing the DNS entries in resolv.conf on my Mac and see it when using a slow connection. Connecting to a VPN changes resolv.conf and also helps replicate this. After the DNS change on occasion it reports:
ERRO[0060] DNS resolution failed for static_map host error="lookup 123.xyz: no such host" hostname=123.xyz network=ip4
Then eventually:
INFO[0090] DNS results changed for host list newSet="map[123.23.23.23:10102:{}]" origSet="&map[]"
and then after another 30 seconds it connects.
I see there is a retry cadence:
https://github.com/slackhq/nebula/pull/879/files
I haven't delved in to the criteria for DNS results changed for host list but it might help if the cadence is lower when there has not yet been a successful DNS lookup, then uses the 30s for subsequent lockups. Also for a connect to be called directly after a DNS results changed for host list if there is not yet a live connection. At the moment, it looks like Nebula is very slow at connecting to lighthouses but I think it is merely the timings of the retries.
I also wonder if the timeout of 200ms is too low for slow connections. I haven't been able to see any improvement by increasing it, but I'm also not sure if there is much benefit to it being that low when slower connections may be using Nebula.
Nebula comes with built-in DNS server support via Lighthouse hosts.
@maggie44 The error you are seeing is different from the error in the original ticket. Have you tried increasing static_map.lookup_timeout? This is the value associated with the "i/o timeout" message from a slow DNS server. If that doesn't work, let's move the "i/o timeout" issue to a separate ticket.
@maggie44 FYI, I've posted a PR here that may improve time-to-recovery in the situation you described. I would like to improve this further in the future (mentioned in the PR): https://github.com/slackhq/nebula/pull/1260
i have a peculiar case where it only fails on system startup my systemd service has After=network-online.target i have set lookup_timeout to 10 s which never fails when i manually restart it after system startup i can provide my systemd unit file and nebula config if needed
I too am getting my logs flooded with these kinds error of messages on my arm64:
level=error msg="DNS resolution failed for static_map host" error="lookup ???.duckdns.org: i/o timeout" hostname=???.duckdns.org network=ip4
Very similar to OP. Tried setting:
static_map:
cadence: 120s
lookup_timeout: 500ms
this slowed down the frequency of the errors, but did not resolve it.
I too am getting my logs flooded with these kinds error of messages on my arm64:
level=error msg="DNS resolution failed for static_map host" error="lookup ???.duckdns.org: i/o timeout" hostname=???.duckdns.org network=ip4Very similar to OP. Tried setting:
static_map: cadence: 120s lookup_timeout: 500msthis slowed down the frequency of the errors, but did not resolve it.
Apologies for this prior "bug" report. I believe the issue lies with duckdns.org not resolving properly. I believe some of their servers are having issues at this time.
I can confirm these issues as well. As far as I can tell (we're working with a very large number of embedded devices using nebula) it seems like any additional DNS entries in /etc/resolv.conf are completely ignored. If the first nameserver doesn't resolve, no further seems to be used ! We found out with a specific device build that didn't purge /etc/resolv.conf after reboot so our local DNS was still on top besides the ones added by a mobile connection made. Nebula won't be able to connect any lighthouse. This was eventually fixed by removing this first line, then the resolution succeeded. A temporary workaround would be to either use IPs in the static host map or add static hosts to /etc/hosts. But we also would prefer "a normal resolver behaviour" (static wins, resolv entries are all taken into account serially if previous fail).