amazon-vpc-cni-k8s Intermittent DNS timeouts in a pod

We have a couple of jobs that run in a pod and the very first thing it's trying to do is download a file from Github. These jobs fail intermittently once per a couple of days with a DNS resolution timeout.

Docker log:

time="2019-08-21T21:15:03Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849/shim.sock" debug=false pid=14981

CNI log:

2019-08-21T21:15:03.381Z [INFO]	AssignPodIPv4Address: Assign IP 172.22.124.20 to pod (name uu-snowflake-updater-1566422100-xmtp9, namespace prod container 61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849)
2019-08-21T21:15:03.381Z [INFO]	Send AddNetworkReply: IPv4Addr 172.22.124.20, DeviceNumber: 0, err: <nil>
2019-08-21T21:15:03.382Z [INFO]	Received add network response for pod uu-snowflake-updater-1566422100-xmtp9 namespace prod container 61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849: 172.22.124.20, table 0, external-SNAT: false, vpcCIDR: [172.22.0.0/16]
2019-08-21T21:15:03.410Z [INFO]	Added toContainer rule for 172.22.124.20/32 hostname:kubecd-prod-nodes-worker @timestamp:August 21st 2019, 17:15:39.000

Container log:

August 21st 2019, 17:15:03.701	  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
August 21st 2019, 17:15:03.701	                                 Dload  Upload   Total   Spent    Left  Speed
August 21st 2019, 17:15:08.771	
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0curl: (6) Could not resolve host: raw.githubusercontent.com

There is basically a less than 300ms delay between CNI finishing setup for the IPtables and veths and curl making a request. Is there a chance for a race condition in this scenario? Since it happens rarely and is intermittent, it doesn't seem to be a configuration issue.

Aug 22 '19 20:08 bbc88ks

Might be related to https://github.com/aws/amazon-vpc-cni-k8s/issues/493 , but we don't use Calico.

Aug 22 '19 20:08 bbc88ks

Could be related to https://github.com/coredns/coredns/pull/2769 if you're using CoreDNS. Upgrading to >1.5.1 should fix the issue if that's the issue you're facing.

You can further verify that that's the issue if curl raw.githubusercontent.com fails intermittently, but curl --ipv4 raw.githubusercontent.com never does.

Oct 01 '19 14:10 deiwin

Seems like it might not be the best idea to upgrade to >1.5.1 when running eks 1.14 https://github.com/aws/containers-roadmap/issues/489

in addition to those performance issues, the proxy plugin is deprecated in future releases of coredns so upgrading from 1.3.1 to 1.5.2 in existing clusters using the same configmap won't be successful.

Oct 02 '19 23:10 jnaulty

It seems to be still OK to upgrade. We changed proxy -> forward in the ConfigMap and added ready to the ConfigMap and a readinessProbe to the Deployment spec.

     .:53 {
         errors
         health
+        ready
         kubernetes cluster.local {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
         }
         prometheus :9153
-        proxy . /etc/resolv.conf
+        forward . /etc/resolv.conf
         cache 30
     }

&

-        image: 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/coredns:v1.2.6
+        image: coredns/coredns:1.6.3
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 5
@@ -548,6 +548,11 @@ spec:
         - containerPort: 9153
           name: metrics
           protocol: TCP
+        readinessProbe:
+          httpGet:
+            path: /ready
+            port: 8181
+            scheme: HTTP
         resources:
           limits:
             memory: 170Mi

Some related links:

https://coredns.io/plugins/ready/
https://github.com/coredns/deployment/blob/576c4b687a0130bb27d8f8a777875fe3dfc0aa93/kubernetes/coredns.yaml.sed#L142-L146

Oct 03 '19 06:10 deiwin

@bbc88ks Hi Val, would you mind letting us know if upgrading to the forward CoreDNS plugin (in the configmap) resolved your issues? (pun intended)

Thanks, and sorry for the delay in getting back to you on this!

Dec 10 '19 15:12 jaypipes

@bbc88ks, Wondering if this is related to kernel race condition issue, where coreDNS would send parallel requests and which ever request wins the race would get an entry in conntrack table and other would get an insert failed.

Can you confirm if below solves your issue ?

dnsConfig:
options:
– name: single-request-reopen

Dec 10 '19 23:12 nithu0115

@jaypipes We've recently updated our clusters to 1.14, core-dns 1.6.5 and aws-vpc-cni 1.5.5. The issue still exists. Is there an iptables trace rule we can set to get more insight? I think kube-dns is 10.100.0.10. I am not able to reproduce it when I am just running an adhoc pod, that's why it's hard to pin down.

@nithu0115 should single-request-reopen be set in the affected pod config? Some of them are based of Alpine with musl libc, so I don't think that would work.

Dec 18 '19 19:12 bbc88ks

@bbc88ks As long as you use musl, there is always a risk of having DNS issues. See https://github.com/kubernetes/kubernetes/issues/56903

Dec 19 '19 01:12 mogren

After setting up node-local dns cache we no longer see github timeouts. But one other issue ( we though it's the same thing, but apparently it's not) still happens once per couple of days - Kube API timeouts on new pods.

2020/01/02 13:15:45 Delete https://10.100.0.1:443/apis/argoproj.io/v1alpha1/namespaces/prod/workflows/uu-snowflake-updater-1577970900: dial tcp 10.100.0.1:443: i/o timeout
2020/01/02 13:16:15 Failed to submit workflow: Post https://10.100.0.1:443/apis/argoproj.io/v1alpha1/namespaces/prod/workflows: dial tcp 10.100.0.1:443: i/o timeout

It's intermittent and so far it seems to be happening if the pod is allocated on the primary interface.

Jan 02 '20 19:01 bbc88ks

@bbc88ks Thanks a lot for the update. I guess that means there is at least two different issues involved, so we still need to keep digging for the underlying cause here.

Jan 02 '20 22:01 mogren

Github issues are still occurring, but now instead of a name resolution error it's just a plain 443 timeout. Also every time we recycle and deploy new nodes we don't see this issue for a couple of days.

Jan 06 '20 21:01 bbc88ks

@bbc88ks Is this still an issue with the latest EKS AMIs? Amazon Linux backported fixes for the conntrack kernel issues.

Apr 29 '20 17:04 mogren

@mogren we face the same issue here ,and we follow exactly the same recommendation to use node-local-dns and after upgrading the EKS AMI we still facing the same issue, I am just wondering if there any work around

Jun 25 '20 19:06 Eslamanwar

Hi @Eslamanwar! We have recently found out that using TCP for DNS lookups can cause issues. Do you mind checking that options use-vc is not set in the /etc/resolv.conf and that force_tcp is not set for CoreDNS. Also, it's best to make sure that the size of the DNS responses is less than 4096 bytes, to ensure it fits in the UDP packets.

Jun 25 '20 23:06 mogren

@mogren Thanks alot ,after remove force_tcp , no dns request fails .

Jun 26 '20 10:06 Eslamanwar

@mogren, if that's the case, then you might want to also update some documentation. Currently EKS docs refer to Kubernetes docs, which suggest configuring force_tcp for the node-local DNS cache.

Jun 26 '20 10:06 deiwin

We were facing the same issue in our clusters. We were using node-local-dns and in ints configuration, we had the force_tcp flag. We were getting lots of timeouts that way. After removing the flag, the timeouts went away.

Jul 10 '20 13:07 gurumaia

We were having the same issue here https://github.com/kubernetes/dns/issues/387 I've removed force_tcp flag from forwarder config in node-localdns but I definetly still see TCP requests and responses to the upstream AWS VPC resolver, and response times are not good - 4,2 and 1 seconds for about 1% of requests, however when I set prefer_udp I see only UDP requests and responses and response times are all good. We use k8s.gcr.io/k8s-dns-node-cache:1.15.12 image.

Jul 13 '20 10:07 123BLiN

Thanks for the suggestion @123BLiN! I'll test that out as well.

Jul 13 '20 16:07 mogren

I also have a response from AWS Support team on my case:

We have identified that you were using TCP DNS connections to the VPC Resolver. We have identified the root cause as limitations in the current TCP DNS handling on EC2 Nitro Instances. The software which forwards DNS requests to our fleet for resolution is limited to 2 simultaneous TCP connections and blocks on TCP queries for each connection. Volume exceeding 2 simultaneous requests will result in increased latency. It is our recommendation that you prefer UDP DNS lookups to prevent an increase in latency. This should provide an optimal path for DNS requests that are less than 4096 bytes, and minimize the TCP latency to DNS names which exceed 4096 bytes. We plan to address the limitations on TCP DNS queries for Nitro instances, but do not have an ETA for the fix yet.

Jul 15 '20 08:07 123BLiN

🤯

We plan to address the limitations on TCP DNS queries for Nitro instances, but do not have an ETA for the fix yet.

This should be very clearly documented somewhere.

Jul 15 '20 09:07 Vlaaaaaaad

This just bit us. Any update on a fix? At least documenting this would be helpful.

Mar 25 '21 23:03 irlevesque

Hello, any solution for that? We have terrible issues in production because of it

May 23 '21 20:05 Shahard2

We're also experiencing this issue with 4.14.243-185.433.amzn2.x86_64 Kernel version

Oct 08 '21 11:10 syndbg

Had the same problem, reached out to aws they confirmed it and said they have no eta for the fix. Just thought i would update the thread with the info incase someone else had the problem

this is what they suggested

 Workaround:

1. Linux users should ensure 'options use-vc’ does not appear in the /etc/resolv.conf
2. Users of CoreDNS should ensure that the option ‘force_tcp’ is not enabled.
3. Customers should make the size of the DNS responses is less than 4096 bytes to ensure it fits in UDP packets.