DNS resolution fails if default search domain has a wildcard match
Name resolution from inside the pod seams to be broken because of multiple factor.
Version
# oc version
oc v3.7.0-rc.0+e92d5c5
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://127.0.0.1:8443
openshift v3.7.0-rc.0+e92d5c5
kubernetes v1.7.6+a08f5eeb62
Steps To Reproduce
Look like the /etc/resolv.conf file generated by openshift is not working in every scenario.
Just to show it's working with something...
# cat /etc/resolv.conf
nameserver 8.8.8.8
search patrikdufresne.com
# nslookup -debug dl-cdn.alpinelinux.org
Server: 8.8.8.8
Address: 8.8.8.8#53
------------
QUESTIONS:
dl-cdn.alpinelinux.org, type = A, class = IN
ANSWERS:
-> dl-cdn.alpinelinux.org
canonical name = global.prod.fastly.net.
ttl = 59
-> global.prod.fastly.net
internet address = 151.101.0.249
ttl = 19
-> global.prod.fastly.net
internet address = 151.101.64.249
ttl = 19
-> global.prod.fastly.net
internet address = 151.101.128.249
ttl = 19
-> global.prod.fastly.net
internet address = 151.101.192.249
ttl = 19
AUTHORITY RECORDS:
ADDITIONAL RECORDS:
------------
Non-authoritative answer:
dl-cdn.alpinelinux.org canonical name = global.prod.fastly.net.
Name: global.prod.fastly.net
Address: 151.101.0.249
Name: global.prod.fastly.net
Address: 151.101.64.249
Name: global.prod.fastly.net
Address: 151.101.128.249
Name: global.prod.fastly.net
Address: 151.101.192.249
This is the /etc/resolv.conf generated in the pod. not working
# cat /etc/resolv.conf
nameserver 8.8.8.8
search default.svc.cluster.local svc.cluster.local cluster.local patrikdufresne.com
options ndots:5
# nslookup -debug dl-cdn.alpinelinux.org
Server: 8.8.8.8
Address: 8.8.8.8#53
------------
QUESTIONS:
dl-cdn.alpinelinux.org.default.svc.cluster.local, type = A, class = IN
ANSWERS:
AUTHORITY RECORDS:
-> .
origin = a.root-servers.net
mail addr = nstld.verisign-grs.com
serial = 2017111401
refresh = 1800
retry = 900
expire = 604800
minimum = 86400
ttl = 86385
ADDITIONAL RECORDS:
------------
** server can't find dl-cdn.alpinelinux.org.default.svc.cluster.local: NXDOMAIN
Server: 8.8.8.8
Address: 8.8.8.8#53
------------
QUESTIONS:
dl-cdn.alpinelinux.org.svc.cluster.local, type = A, class = IN
ANSWERS:
AUTHORITY RECORDS:
-> .
origin = a.root-servers.net
mail addr = nstld.verisign-grs.com
serial = 2017111401
refresh = 1800
retry = 900
expire = 604800
minimum = 86400
ttl = 86394
ADDITIONAL RECORDS:
------------
** server can't find dl-cdn.alpinelinux.org.svc.cluster.local: NXDOMAIN
Server: 8.8.8.8
Address: 8.8.8.8#53
------------
QUESTIONS:
dl-cdn.alpinelinux.org.cluster.local, type = A, class = IN
ANSWERS:
AUTHORITY RECORDS:
-> .
origin = a.root-servers.net
mail addr = nstld.verisign-grs.com
serial = 2017111401
refresh = 1800
retry = 900
expire = 604800
minimum = 86400
ttl = 86378
ADDITIONAL RECORDS:
------------
** server can't find dl-cdn.alpinelinux.org.cluster.local: NXDOMAIN
Server: 8.8.8.8
Address: 8.8.8.8#53
------------
QUESTIONS:
dl-cdn.alpinelinux.org.patrikdufresne.com, type = A, class = IN
ANSWERS:
AUTHORITY RECORDS:
-> patrikdufresne.com
origin = ns2.no-ip.com
mail addr = hostmaster.no-ip.com
serial = 2010091255
refresh = 10800
retry = 1800
expire = 604800
minimum = 1800
ttl = 1799
ADDITIONAL RECORDS:
------------
Non-authoritative answer:
*** Can't find dl-cdn.alpinelinux.org: No answer
If I remove my domain name patrikdufresne.com. working
# cat /etc/resolv.conf
nameserver 8.8.8.8
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
root@tymara:/home/ikus060# nslookup dl-cdn.alpinelinux.org
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
dl-cdn.alpinelinux.org canonical name = global.prod.fastly.net.
Name: global.prod.fastly.net
Address: 151.101.0.249
Name: global.prod.fastly.net
Address: 151.101.64.249
Name: global.prod.fastly.net
Address: 151.101.128.249
Name: global.prod.fastly.net
Address: 151.101.192.249
Also working if I remove ndots:5.
# cat /etc/resolv.conf
nameserver 8.8.8.8
search default.svc.cluster.local svc.cluster.local cluster.local patrikdufresne.com
root@tymara:/home/ikus060# nslookup dl-cdn.alpinelinux.org
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
dl-cdn.alpinelinux.org canonical name = global.prod.fastly.net.
Name: global.prod.fastly.net
Address: 151.101.0.249
Name: global.prod.fastly.net
Address: 151.101.64.249
Name: global.prod.fastly.net
Address: 151.101.128.249
Name: global.prod.fastly.net
Address: 151.101.192.249
I ran into this exact same issue with a fresh installation of OCP 3.7 on a RHEL 7.4 VM.
The outbound networking worked from the VM. The outbound networking also worked when I ran a container out of band from Kubernetes (using docker run). OCP ran the container, the outbound networking broke but it could be fixed by removing the options ndots:5 or "search josborne.com". I couldn't figure out where "search josborne.com" was even coming from because I didn't set that anywhere in the Ansible advanced installation. I changed my /etc/hostname file from openshift.josborne.com to openshift and rebooted. At that point "search josborne.com" was removed from the pod /etc/resolv.conf and everything started working. Is this user error or a bug? I've installed every release of OCP from scratch using a FQDN in my /etc/hostname file and it first broke in either 3.6 or 3.7 so I think something has changed in the platform.
Right, so the problem is that if the domain that gets listed in the search line does wildcard matching, then because of the ndots:5, basically all hostnames will end up being treated as subdomains of the default domain. Eg, *.josbourne.com appears to resolve to a particular AWS hostname, so if you look up, say, github.com, it ends up matching as github.com.josbourne.com which resolves to the AWS IP.
I guess the search field in the pod resolv.conf is set automatically from the node hostname?
What we really want is to make service name lookups behave like ndots:5, but make other lookups not do that. We can't make the libc resolver do that, but in cases where we're running a DNS server inside the cluster, we could do the ndots-like special-casing inside that server, and then we could give the pods a resolv.conf without ndots.
The other possibility would be to stop including the node's domain in the pod resolv.conf's search field, but that would break any existing pods that were depending on the current behavior, so we'd need some sort of compatibility option.
Since the way to install openshift is to go with ansible playbook. I would add extra validation in ansible to make sure the provided DNS domain is behaving as you like. If not, the playbook should fail and warn the user.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten /remove-lifecycle stale
This is still an issue. /remove-lifecycle rotten
For minishift this is an issue with some Hypervisor that force a search entry from the DHCP offer. Eg. HyperV on the "default switch" uses search mshome.net and can cause lookups during S2i to github.com to fail
Note: the options ndots:5 is part of Kubernetes since about 2015 => https://github.com/kubernetes/kubernetes/pull/10266/commits/23caf446ae69236641da0fdc432d4cfb5fff098d#diff-0db82891d463ba14dd59da9c77f4776eR66 (ref: https://github.com/kubernetes/kubernetes/pull/10266)
Same issue with ansible install openshift 3.10
Same for me: ndots:5 makes it substitute domain name (from search line) before checking original address
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale /lifecycle frozen
Hello, is there a workaround for this? I seem to be facing the same issue with k8s 1.19, coredns and my external domain which is part of the DNS search path, having wildcard match