amazon-vpc-cni-k8s
amazon-vpc-cni-k8s copied to clipboard
Intermittent DNS timeouts in a pod
We have a couple of jobs that run in a pod and the very first thing it's trying to do is download a file from Github. These jobs fail intermittently once per a couple of days with a DNS resolution timeout.
Docker log:
time="2019-08-21T21:15:03Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849/shim.sock" debug=false pid=14981
CNI log:
2019-08-21T21:15:03.381Z [INFO] AssignPodIPv4Address: Assign IP 172.22.124.20 to pod (name uu-snowflake-updater-1566422100-xmtp9, namespace prod container 61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849)
2019-08-21T21:15:03.381Z [INFO] Send AddNetworkReply: IPv4Addr 172.22.124.20, DeviceNumber: 0, err: <nil>
2019-08-21T21:15:03.382Z [INFO] Received add network response for pod uu-snowflake-updater-1566422100-xmtp9 namespace prod container 61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849: 172.22.124.20, table 0, external-SNAT: false, vpcCIDR: [172.22.0.0/16]
2019-08-21T21:15:03.410Z [INFO] Added toContainer rule for 172.22.124.20/32 hostname:kubecd-prod-nodes-worker @timestamp:August 21st 2019, 17:15:39.000
Container log:
August 21st 2019, 17:15:03.701 % Total % Received % Xferd Average Speed Time Time Time Current
August 21st 2019, 17:15:03.701 Dload Upload Total Spent Left Speed
August 21st 2019, 17:15:08.771
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0curl: (6) Could not resolve host: raw.githubusercontent.com
There is basically a less than 300ms delay between CNI finishing setup for the IPtables and veths and curl making a request. Is there a chance for a race condition in this scenario? Since it happens rarely and is intermittent, it doesn't seem to be a configuration issue.
Might be related to https://github.com/aws/amazon-vpc-cni-k8s/issues/493 , but we don't use Calico.
Could be related to https://github.com/coredns/coredns/pull/2769 if you're using CoreDNS. Upgrading to >1.5.1 should fix the issue if that's the issue you're facing.
You can further verify that that's the issue if curl raw.githubusercontent.com
fails intermittently, but curl --ipv4 raw.githubusercontent.com
never does.
Seems like it might not be the best idea to upgrade to >1.5.1 when running eks 1.14 https://github.com/aws/containers-roadmap/issues/489
in addition to those performance issues, the proxy plugin is deprecated in future releases of coredns so upgrading from 1.3.1 to 1.5.2 in existing clusters using the same configmap won't be successful.
It seems to be still OK to upgrade. We changed proxy
-> forward
in the ConfigMap and added ready
to the ConfigMap and a readinessProbe
to the Deployment spec.
.:53 {
errors
health
+ ready
kubernetes cluster.local {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
- proxy . /etc/resolv.conf
+ forward . /etc/resolv.conf
cache 30
}
&
- image: 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/coredns:v1.2.6
+ image: coredns/coredns:1.6.3
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
@@ -548,6 +548,11 @@ spec:
- containerPort: 9153
name: metrics
protocol: TCP
+ readinessProbe:
+ httpGet:
+ path: /ready
+ port: 8181
+ scheme: HTTP
resources:
limits:
memory: 170Mi
Some related links:
- https://coredns.io/plugins/ready/
- https://github.com/coredns/deployment/blob/576c4b687a0130bb27d8f8a777875fe3dfc0aa93/kubernetes/coredns.yaml.sed#L142-L146
@bbc88ks Hi Val, would you mind letting us know if upgrading to the forward CoreDNS plugin (in the configmap) resolved your issues? (pun intended)
Thanks, and sorry for the delay in getting back to you on this!
@bbc88ks, Wondering if this is related to kernel race condition issue, where coreDNS would send parallel requests and which ever request wins the race would get an entry in conntrack table and other would get an insert failed.
Can you confirm if below solves your issue ?
dnsConfig:
options:
– name: single-request-reopen
@jaypipes We've recently updated our clusters to 1.14, core-dns 1.6.5 and aws-vpc-cni 1.5.5. The issue still exists. Is there an iptables trace rule we can set to get more insight? I think kube-dns is 10.100.0.10
. I am not able to reproduce it when I am just running an adhoc pod, that's why it's hard to pin down.
@nithu0115 should single-request-reopen
be set in the affected pod config? Some of them are based of Alpine with musl libc, so I don't think that would work.
@bbc88ks As long as you use musl, there is always a risk of having DNS issues. See https://github.com/kubernetes/kubernetes/issues/56903
After setting up node-local dns cache we no longer see github timeouts. But one other issue ( we though it's the same thing, but apparently it's not) still happens once per couple of days - Kube API timeouts on new pods.
2020/01/02 13:15:45 Delete https://10.100.0.1:443/apis/argoproj.io/v1alpha1/namespaces/prod/workflows/uu-snowflake-updater-1577970900: dial tcp 10.100.0.1:443: i/o timeout
2020/01/02 13:16:15 Failed to submit workflow: Post https://10.100.0.1:443/apis/argoproj.io/v1alpha1/namespaces/prod/workflows: dial tcp 10.100.0.1:443: i/o timeout
It's intermittent and so far it seems to be happening if the pod is allocated on the primary interface.
@bbc88ks Thanks a lot for the update. I guess that means there is at least two different issues involved, so we still need to keep digging for the underlying cause here.
Github issues are still occurring, but now instead of a name resolution error it's just a plain 443 timeout. Also every time we recycle and deploy new nodes we don't see this issue for a couple of days.
@bbc88ks Is this still an issue with the latest EKS AMIs? Amazon Linux backported fixes for the conntrack kernel issues.
@mogren we face the same issue here ,and we follow exactly the same recommendation to use node-local-dns and after upgrading the EKS AMI we still facing the same issue, I am just wondering if there any work around
Hi @Eslamanwar! We have recently found out that using TCP for DNS lookups can cause issues. Do you mind checking that options use-vc
is not set in the /etc/resolv.conf and that force_tcp
is not set for CoreDNS. Also, it's best to make sure that the size of the DNS responses is less than 4096 bytes, to ensure it fits in the UDP packets.
@mogren Thanks alot ,after remove force_tcp , no dns request fails .
@mogren, if that's the case, then you might want to also update some documentation. Currently EKS docs refer to Kubernetes docs, which suggest configuring force_tcp
for the node-local DNS cache.
We were facing the same issue in our clusters. We were using node-local-dns and in ints configuration, we had the force_tcp
flag. We were getting lots of timeouts that way. After removing the flag, the timeouts went away.
We were having the same issue here https://github.com/kubernetes/dns/issues/387
I've removed force_tcp
flag from forwarder config in node-localdns but I definetly still see TCP requests and responses to the upstream AWS VPC resolver, and response times are not good - 4,2 and 1 seconds for about 1% of requests, however when I set prefer_udp
I see only UDP requests and responses and response times are all good.
We use k8s.gcr.io/k8s-dns-node-cache:1.15.12
image.
Thanks for the suggestion @123BLiN! I'll test that out as well.
I also have a response from AWS Support team on my case:
We have identified that you were using TCP DNS connections to the VPC Resolver. We have identified the root cause as limitations in the current TCP DNS handling on EC2 Nitro Instances. The software which forwards DNS requests to our fleet for resolution is limited to 2 simultaneous TCP connections and blocks on TCP queries for each connection. Volume exceeding 2 simultaneous requests will result in increased latency. It is our recommendation that you prefer UDP DNS lookups to prevent an increase in latency. This should provide an optimal path for DNS requests that are less than 4096 bytes, and minimize the TCP latency to DNS names which exceed 4096 bytes. We plan to address the limitations on TCP DNS queries for Nitro instances, but do not have an ETA for the fix yet.
🤯
We plan to address the limitations on TCP DNS queries for Nitro instances, but do not have an ETA for the fix yet.
This should be very clearly documented somewhere.
This just bit us. Any update on a fix? At least documenting this would be helpful.
Hello, any solution for that? We have terrible issues in production because of it
We're also experiencing this issue with 4.14.243-185.433.amzn2.x86_64
Kernel version
Had the same problem, reached out to aws they confirmed it and said they have no eta for the fix. Just thought i would update the thread with the info incase someone else had the problem
this is what they suggested
Workaround:
1. Linux users should ensure 'options use-vc’ does not appear in the /etc/resolv.conf
2. Users of CoreDNS should ensure that the option ‘force_tcp’ is not enabled.
3. Customers should make the size of the DNS responses is less than 4096 bytes to ensure it fits in UDP packets.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
Issue closed due to inactivity.
/reopen
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/notstale