docker-alpine icon indicating copy to clipboard operation
docker-alpine copied to clipboard

DNS Issue

Open MosheMoradSimgo opened this issue 7 years ago • 53 comments

Hi,

We are running alpine (3.4) in a docker container over a Kubernetes cluster (GCP).

We have been seeing some anomalies where our thread is stuck for 2.5 sec. After some research using strace we saw that DNS resolving gets timed-out once in a while.

Here are some examples:

23:18:27 recvfrom(5, "\f\361\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\1\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\243\213\360\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000045>
23:18:27 recvfrom(5, 0x7ffdd0e1fb90, 512, 0, 0x7ffdd0e1f640, 0x7ffdd0e1f61c) = -1 EAGAIN (Resource temporarily unavailable) <0.000014>
23:18:27 clock_gettime(CLOCK_REALTIME, {1487114307, 714908396}) = 0 <0.000015>
23:18:27 poll([{fd=5, events=POLLIN}], 1, 2499) = 0 (Timeout) <2.502024>

09:04:27 recvfrom(5<UDP:[0.0.0.0:36148]>, "\354\211\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\1\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\244\30\220\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000041>
09:04:27 recvfrom(5<UDP:[0.0.0.0:36148]>, 0x7ffec3d9b0b0, 512, 0, 0x7ffec3d9ab60, 0x7ffec3d9ab3c) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>
09:04:27 clock_gettime(CLOCK_REALTIME, {1487149467, 555317749}) = 0 <0.000008>
09:04:27 poll([{fd=5<UDP:[0.0.0.0:36148]>, events=POLLIN}], 1, 2498) = 0 (Timeout) <2.499671>


09:18:47 recvfrom(5<UDP:[0.0.0.0:47282]>, " B\201\200\0\1\0\1\0\0\0\0\2db\6devone\5*****\3net\0\0\1\0\1\300\f\0\1\0\1\0\0\0\200\0\4h\307\16N", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 53 <0.000011>
09:18:47 recvfrom(5<UDP:[0.0.0.0:47282]>, 0x7ffdd0e1fb90, 512, 0, 0x7ffdd0e1f640, 0x7ffdd0e1f61c) = -1 EAGAIN (Resource temporarily unavailable) <0.000008>
09:18:47 clock_gettime(CLOCK_REALTIME, {1487150327, 679292144}) = 0 <0.000005>
09:18:47 poll([{fd=5<UDP:[0.0.0.0:47282]>, events=POLLIN}], 1, 2497) = 0 (Timeout) <2.498797>

And a good example:

08:22:25 recvfrom(5<UDP:[0.0.0.0:59162]>, "\20j\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\34\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\244\n\200\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000014>
08:22:25 recvfrom(5<UDP:[0.0.0.0:59162]>, 0x7ffec3d9aeb0, 512, 0, 0x7ffec3d9ab60, 0x7ffec3d9ab3c) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>
08:22:25 clock_gettime(CLOCK_REALTIME, {1487146945, 638264715}) = 0 <0.000010>
08:22:25 poll([{fd=5<UDP:[0.0.0.0:59162]>, events=POLLIN}], 1, 2498) = 1 ([{fd=5, revents=POLLIN}]) <0.000010>

In the past we already had some issues with DNS resolving in older an version(3.3), which have been resolved since we moved to 3.4 (or so we thought).

Is this a known issue? Does anybody have a solution / workaround / suggestion what to do?

Thanks a lot.

MosheMoradSimgo avatar Feb 19 '17 14:02 MosheMoradSimgo

Have the same issue Alpine: 3.5 Docker: 1.13.1-cs2

/ # time ping -c 1 dev11
PING dev11 (10.1.100.11): 56 data bytes
64 bytes from 10.1.100.11: seq=0 ttl=63 time=0.211 ms

--- dev11 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.211/0.211/0.211 ms
real    0m 2.50s
user    0m 0.00s
sys     0m 0.00s

Sartner avatar Feb 25 '17 07:02 Sartner

Hi,

With the latest version (3.5), I am experiencing below error.

fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/community: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/main: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.3/main/x86_64/APKINDEX.tar.gz: DNS lookup error
ERROR: unsatisfiable constraints:
  bash (missing):
    required by: world[bash]
  ca-certificates (missing):
    required by: world[ca-certificates]
  curl (missing):
    required by: world[curl]

Can anyone please help me in resolving it and moving forward

Thanks

rawat-he avatar Mar 17 '17 05:03 rawat-he

The latter two comments don't sound like the same issue. This seems like a Kubernetes specific thing. Do you know if it happens to only Alpine containers or does it affect others as well? I've heard of intermittent DNS resolving issues in Kubernetes. But they were not specific to Alpine.

andyshinn avatar May 05 '17 17:05 andyshinn

We're seeing slow DNS resolution in alpine:3.4 (not in Kubernetes):

$ time docker run --rm alpine:3.4 nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve    

Name:      google.com        
Address 1: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 2: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 3: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 4: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net

real    0m2.996s             
user    0m0.010s             
sys     0m0.005s  

Versus Busybox:

$ time docker run --rm busybox nslookup google.com
Server:    10.108.88.10      
Address 1: 10.108.88.10      

Name:      google.com        
Address 1: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net
Address 2: 216.58.204.78 lhr25s13-in-f14.1e100.net         
Address 3: 216.58.204.78 lhr25s13-in-f14.1e100.net         
Address 4: 216.58.204.78 lhr25s13-in-f14.1e100.net

real    0m0.545s             
user    0m0.011s             
sys     0m0.007s

Not sure what the null error suggests, but it might be related!

Docker version 17.05.0-ce, build 89658be

c24w avatar Jun 02 '17 15:06 c24w

I have an issue with DNS resolving in alpine. I have /etc/resolv.conf config with several search suffixes (6 suffixes). And during DNS resolving I see that my DNS server answers only first 6 or 7 requests (this is DNS DoS protection). But according to strace output alpine does 2 requests for each search suffix.

Ubuntu docker image doesn't have this problem - it does only one request for each name suffix.

So is it possible to fix this behaviour and make only 1 request to DNS server for each domain name suffix. This is important because kubernetes usually put 3 search suffixes. So if we have more than one our own search suffixes and we have DNS server that limits requests from single IP than most likely we get DNS resolution problem.

mpashka avatar Aug 03 '17 23:08 mpashka

yes ,latest alpine image has problem in DNS resolve ,all my app image build on alpine have same problem on kubernetes v1.7.0


[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup heapster.kube-system
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      heapster.kube-system
Address 1: 10.100.249.248 heapster.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup http-svc.kube-system
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      http-svc.kube-system
Address 1: 10.102.217.7 http-svc.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup ftpserver-service.demo
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'ftpserver-service.demo'

justlooks avatar Aug 11 '17 02:08 justlooks

During my investigations I've found that I have a problem with my DNS server. Some time ago alpine didn't support resolv.conf options 'search' and 'domains'. But that is not the case now. They also claim they do resolving in parallel and thus results can differ. But this is not the case for me also. I've found that alpine makes 2 requests because one is for ipv4 (A record) and other is for ipv6 (AAAA record). My trouble is related to DNS server itself. If there are several search domains in resolv.conf and for some of that domains DNS server reports 'Server failure' (RCODE = 2) then alpine retries this name. If DNS server reports 'No such name' (RCODE = 3) then alpine continues with next search domain. Ubuntu on the other hand doesn't treat 'Server failure' (RCODE = 2) as DNS server failure and just coninues to fetch other search domains. You can check DNS server rcode result for some specific dns name using command # dig @<dns_server> dns_name_to_check and check 'status:' field - it can be NXDOMAIN (which is 'No such name' RCODE = 3) or SERVFAIL. BTW nslookup operates in the same manner. It respects RCODE and stopps if DNS server responce 'Server failure' (RCODE = 2)

mpashka avatar Aug 11 '17 13:08 mpashka

I tried on alpine-docker 3.7, with /etc/resolv.conf as follow:

nameserver 10.254.0.100
search  localdomain  somebaddomain
options ndots:5

My DNS server "10.254.0.100" manage its own domain 'localdomain' while forward query of other domain to some external dns server. Then when I query google.com, alpine dnsclient would

  1. try google.com.localdomain, and get a "NXDomain" response
  2. try google.com.somebaddomain, but get a "Refused" response, but after receive a "Refused/SERVFAIL" response, alpine client would keep retry "google.com.somebaddomain", resulting in the final failure.

I also try centos/ubuntu docker image, those dns client would giveup those "Refused/Servfail" response and keep next trial of "google.com" and got an expected response.

Is it the secure/expect reaction to retry same dns after receiving "Refused/Servfail" response or it is a bug in alpine.

zq-david-wang avatar Apr 10 '18 11:04 zq-david-wang

We got probably the same issue. Two different containers running in the same cluster in parallel:

  • image with 3.5.2 works normal, AWS DNS resolves in 0.01s
  • image with 3.7.0 has big lag, DNS could be resolved in 5 seconds or could not be resolved at all.

KIVagant avatar May 11 '18 08:05 KIVagant

For the DNS delay try to add the line: options single-request in the resolv.conf See https://wiki.archlinux.org/index.php/Domain_name_resolution#Hostname_lookup_delayed_with_IPv6

zioalex avatar May 25 '18 15:05 zioalex

I don't think musl (which is used by Alpine) has the single-request resolver option.

joshbenner avatar May 29 '18 13:05 joshbenner

I tried following changes, it seems work. (Tried on my cluster and push to davidzqwang/alpine-dns:3.7)

diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c
index 209c20f..abb7da5 100644
--- a/src/network/lookup_name.c
+++ b/src/network/lookup_name.c
@@ -202,7 +202,7 @@ static int name_from_dns_search(struct address buf[static MAXADDRS], char canon[
                        memcpy(canon+l+1, p, z-p);
                        canon[z-p+1+l] = 0;
                        int cnt = name_from_dns(buf, canon, canon, family, &conf);
-                       if (cnt) return cnt;
+                       if (cnt > 0 || cnt == EAI_AGAIN) return cnt;
                }
        }

zq-david-wang avatar Jun 11 '18 16:06 zq-david-wang

I have tested 3.6, 3.7 and edge and all are affected by https://bugs.busybox.net/show_bug.cgi?id=675. Alpine 3.7, and edge use BusyBox v1.27.2 (2017-12-12 10:41:50 GMT) multi-call binary., but if I pulll busybox:1.27.2 and test nslookup, it doesn't have the error. So I am not sure if just upgrading busybox will fix the issue. The busybox bug report hints that the libc in use will influence the problem.

runephilosof avatar Jun 12 '18 10:06 runephilosof

fetch http://mirror.ps.kz/alpine/v3.8/main/x86_64/APKINDEX.tar.gz ERROR: http://mirror.ps.kz/alpine/v3.8/main: DNS lookup error WARNING: Ignoring APKINDEX.1b054110.tar.gz: No such file or directory fetch http://mirror.ps.kz/alpine/v3.8/community/x86_64/APKINDEX.tar.gz ERROR: http://mirror.ps.kz/alpine/v3.8/community: DNS lookup error WARNING: Ignoring APKINDEX.ce38122e.tar.gz: No such file or directory

Getting above error. How to fix it

krikri90 avatar Jul 25 '18 14:07 krikri90

Hi,

We're running a couple of Docker container on AWS EC2, the images based on Alpine3.7. The DNS resolution is very slow, here an example:

time nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve

Name:      google.com
Address 1: 216.58.207.174 muc11s04-in-f14.1e100.net
Address 2: 2a00:1450:4016:80a::200e muc11s12-in-x0e.1e100.net
real    0m 2.53s
user    0m 0.00s
sys     0m 0.00s

Another test by curl cmd:

time curl https://packagist.org/packages/list.json?vendor=composer  --output list.json
% Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current
                             Dload  Upload   Total   Spent    Left  
Speed
100   174    0   174    0     0     58      0 --:--:--  0:00:03 --:--:--    48
real    0m 3.61s
user    0m 0.01s
sys 0m 0.00s

Which is interesting if we put -4 option for curl which for resolving the address to IPV4, the result is much faster as it should be:

time curl -4 https://packagist.org/packages/list.json?vendor=composer  --output list.json
% Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current
                             Dload  Upload   Total   Spent    Left  
Speed
100   174    0   174    0     0    174      0 --:--:-- --:--:-- --:--:--  1359
real    0m 0.13s
user    0m 0.01s
sys 0m 0.00s

There's a workaround proposed here: https://github.com/gliderlabs/docker-alpine/issues/313#issuecomment-409872142

Is there any soonish release to fix that? Thx

sadok-f avatar Aug 22 '18 16:08 sadok-f

FYI @brb has found some kernel race conditions which relate to this symptom. See https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts for technical details

bboreham avatar Aug 22 '18 19:08 bboreham

I found if i install bind-tools it will all be ok RUN apk add bind-tools

zhouqiang-cl avatar Aug 26 '18 17:08 zhouqiang-cl

@zhouqiang-cl Unfortunately RUN apk add bind-tools does not solve my name resolution problems. I am running a container with Alpine 3.8 in AWS Fargate and i am getting errors during resolving hostnames.

EDIT: I moved as well to debian stretch slim and my dns problems seems to be solved.

sebastianfuss avatar Aug 31 '18 07:08 sebastianfuss

I have converted a few images to Debian Jessie/Stretch slim and my DNS issues went away. Kubernetes 1.9.7 using kops in AWS. This has been bothering us for a long while.

jurgenweber avatar Sep 02 '18 23:09 jurgenweber

I too am seeing issues with MUSL DNS failure on a bare-metal Kubernetes cluster. The hosts in the cluster are all Ubuntu 18.04 machines using systemd-resolved for local DNS. I can reproduce the issue @sadok-f is having. This is on a Kubernetes 1.11.3 cluster (set up using Kubeadm 1.11.3, with Weave CNI), CoreDNS 1.1.3, systemd 237 on the host. Swapping images out for Debian stretch slim fixes the issues.

based64god avatar Sep 13 '18 14:09 based64god

@zhouqiang-cl @sebastianfuss installing bind-tools just seem to use a statically built binary seem to only solve the nslookup command but not the underlying issue.

jstoja avatar Sep 19 '18 12:09 jstoja

ERROR: tzdata-2018d-r1: temporary error (try again later)

chenyongze avatar Sep 22 '18 15:09 chenyongze

Can confirm the issue running multiple alpine containers in a Kubernetes cluster. Busybox images are fine, only Alpine is affected.

mblaschke avatar Oct 08 '18 08:10 mblaschke

Is any progress for this issue? In my test, a newer musl version can solve this problem

swift1911 avatar Oct 15 '18 03:10 swift1911

@swift1911 could you share with us the test you used and the version of alpine+musl that you used? That would be of tremendeous help to check for a fix!

jstoja avatar Oct 16 '18 17:10 jstoja

Guys how we can push that? It's extremely huge problem!

Mykolaichenko avatar Nov 08 '18 19:11 Mykolaichenko

Is there any way to reproduce this without using kubernetes?

Alternatively, does anyone have a tcpdump trace that shows exactly what is going on?

ncopa avatar Nov 08 '18 20:11 ncopa

@ncopa You can use the client and the server from https://github.com/brb/conntrack-race to reproduce the issue w/o k8s.

brb avatar Nov 15 '18 07:11 brb

I don't know if this will help anyone else, but we found if we ran any alpine-based docker image on-top of amazon's ECS AMI, that we would get a 400ms timeout set in DNS resolution, but we cannot find out where its coming from.

Our resolv.conf looks like:

~ $ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /sbin/dhclient-script
search ec2.internal
nameserver 172.16.0.2

If we use an ubuntu-based image we don't have this issue:

$ sudo iptables -I FORWARD -p udp --sport 53 -j DROP
$ sudo docker run -it bash
bash-4.4# ping tugboat.info
ping: bad address 'tugboat.info'
bash-4.4# ping tecnobrat.com
ping: bad address 'tecnobrat.com'
bash-4.4# exit
exit
[status stage bstolz@ip-172-17-50-25 ~]$ sudo iptables -D FORWARD -p udp --sport 53 -j DROP

image

You can see from the wireshark that it sends a request every 400ms instead of ever 2 seconds like in our resolv.conf

I'm not sure whats causing it, but its causing a lot of DNS timeouts for us.

tecnobrat avatar Nov 19 '18 17:11 tecnobrat

I just realized that options timeout:2 attempts:5 which means: 2s = 2000ms 2000 / 5 = 400ms

Is alpine using an OVERALL timeout of 2 seconds, and then attempting to accomplish 5 attempts within that 2 seconds? Instead of 2 seconds per attempt?

tecnobrat avatar Nov 19 '18 17:11 tecnobrat