kind
kind copied to clipboard
DNS not working after reboot
What happened: I created a new kind cluster, then rebooted my computer. After reboot, the dns cannot resolve adresses
What you expected to happen:
dns can resolve adresses
How to reproduce it (as minimally and precisely as possible):
- create a new kind cluster
- test dns: it's working
- reboot your machine (dont't stop docker before reboot)
- test dns again:
#APISERVER=https://kubernetes.default.svc
#SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
#NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
#TOKEN=$(cat ${SERVICEACCOUNT}/token)
#CACERT=${SERVICEACCOUNT}/ca.crt
#curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api
curl: (6) Could not resolve host: kubernetes.default.svc
Taken from https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/#without-using-a-proxy
Anything else we need to know?:
- dns pods are running
- dns logs:
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.0
linux/amd64, go1.15.3, 054c9ae
dns lookup:
#nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1
#nslookup kubernetes.default.svc
;; connection timed out; no servers could be reached
rslov.conf:
#cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local fritz.box
nameserver 10.96.
Environment:
- kind version: (use
kind version): kind v0.11.1 go1.16.4 linux/amd64 - Kubernetes version: (use
kubectl version): Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} - Docker version: (use
docker info): Client: Context: default Debug Mode: false
Server: Containers: 5 Running: 2 Paused: 0 Stopped: 3 Images: 11 Server Version: 20.10.6-ce Storage Driver: btrfs Build Version: Btrfs v4.15 Library Version: 102 Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: oci runc io.containerd.runc.v2 io.containerd.runtime.v1.linux Default Runtime: runc Init Binary: docker-init containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec init version: Security Options: apparmor seccomp Profile: default Kernel Version: 5.3.18-59.16-default Operating System: openSUSE Leap 15.3 OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 7.552GiB Name: Proxima-Centauri ID: M6J5:OLHQ:FXVM:M7WG:2OUA:SKGW:UCF5:DWJZ:4M7T:YA2W:6FBT:DOLG Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
WARNING: No swap limit support
- OS (e.g. from
/etc/os-release): NAME="openSUSE Leap" VERSION="15.3" ID="opensuse-leap" ID_LIKE="suse opensuse" VERSION_ID="15.3" PRETTY_NAME="openSUSE Leap 15.3" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:opensuse:leap:15.3" BUG_REPORT_URL="https://bugs.opensuse.org" HOME_URL="https://www.opensuse.org/"
I assume this snippet was a copy paste error, is missin the latest 2 digits for the ip address
#cat /etc/resolv.conf search default.svc.cluster.local svc.cluster.local cluster.local fritz.box nameserver 10.96.
are you using one node or mulitple nodes in the cluster? clusters with multiple nodes doesn't handle the reboots
Hi also running into this issue! Although I'm not sure that this is caused by a restart for me necessarily.
$ kubectl run -it --rm --restart=Never busybox1 --image=busybox sh
If you don't see a command prompt, try pressing enter.
/ # nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10:53
** server can't find kubernetes.default: NXDOMAIN
*** Can't find kubernetes.default: No answer
/ #
Here is what I get when I inspect the kind network
$ docker network inspect kind
[
{
"Name": "kind",
"Id": "7d815ef0d0c4adc297aa523aa3336ba89bc6d7212373d3098f12169618c16563",
"Created": "2021-08-24T16:41:41.258730207-07:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": true,
"IPAM": {
"Driver": "default",
"Options": {},
"Config": [
{
"Subnet": "172.18.0.0/16",
"Gateway": "172.18.0.1"
},
{
"Subnet": "fc00:f853:ccd:e793::/64"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"1c47d1b38fe7b0b75e71c21c150aba4d5110ade54d74e2f3db45c5d15d013c59": {
"Name": "konvoy-capi-bootstrapper-control-plane",
"EndpointID": "4b176452133a1881380cae8b3fc55963ec0427ee809bc1b678d261f3c1711931",
"MacAddress": "02:42:ac:12:00:02",
"IPv4Address": "172.18.0.2/16",
"IPv6Address": "fc00:f853:ccd:e793::2/64"
}
},
"Options": {
"com.docker.network.bridge.enable_ip_masquerade": "true",
"com.docker.network.driver.mtu": "1454"
},
"Labels": {}
}
]
$ kind get nodes --name konvoy-capi-bootstrapper
konvoy-capi-bootstrapper-control-plane
output from ip addr
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether 48:2a:e3:0a:7a:8c brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 30:24:32:43:a0:e9 brd ff:ff:ff:ff:ff:ff
inet 192.168.42.76/24 brd 192.168.42.255 scope global dynamic noprefixroute wlp2s0
valid_lft 83634sec preferred_lft 83634sec
inet6 fe80::c3e2:7427:34c8:c265/64 scope link noprefixroute
valid_lft forever preferred_lft forever
25: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1400 qdisc noqueue state DOWN group default
link/ether 02:42:0c:bc:be:aa brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
28: br-7d815ef0d0c4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue state UP group default
link/ether 02:42:08:aa:2f:bb brd ff:ff:ff:ff:ff:ff
inet 172.18.0.1/16 brd 172.18.255.255 scope global br-7d815ef0d0c4
valid_lft forever preferred_lft forever
inet6 fc00:f853:ccd:e793::1/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::42:8ff:feaa:2fbb/64 scope link
valid_lft forever preferred_lft forever
inet6 fe80::1/64 scope link
valid_lft forever preferred_lft forever
30: vethba7cc46@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue master br-7d815ef0d0c4 state UP group default
link/ether 82:3a:43:df:a0:c1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::803a:43ff:fedf:a0c1/64 scope link
valid_lft forever preferred_lft forever
finally logs from a coredns pod
35365->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:36799->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:55841->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:38716->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:51342->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:46009->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:33070->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:34194->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:56925->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:35681->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:42683->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:40842->172.18.0.1:53: i/o timeout
Hey, for us, the same issue happens after stopping/rebooting docker. The same issue keeps reproducing with 2 different hosts @romansworks
Edit: we're running a single node setup, with the following config (copied from the website):
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
@AlmogBaku I still can't reproduce this in any of our environments. We need to know more about yours.
That usually happens after a few times I'm closing the Docker.
Both me and @RomansWorks are using macOS
I have the same issue here in my dev environment... the weird thing is when I connect into the pod using bash and try nslookup the DNS works as you can see in the image below:

But when I try it into my application it can not be solved and everything just doesn't work... and there is no error returned (that is weird too)

Although, if I use the POD IP it works normally...

My stack is:
- Docker 20.10.11
- K8s 1.21.1 (kindest/node default, but I already tested with all others supported versions)
- Kind 0.11.1 (single cluster)
NOTES:
- I created the kind cluster using the script with the local registry that can be found here https://kind.sigs.k8s.io/docs/user/local-registry/
@alexandresgf please don't use screenshot, those are hard to read.
Is this problem happening after reboot or it never worked?
@alexandresgf please don't use screenshot, those are hard to read.
Sorry for that!
Is this problem happening after reboot or it never worked?
At first it worked for a while, then sundenlly it happened after a reboot and the DNS never worked anymore even I removing the kind completely and doing a fresh install.
I got a similar problem. I created a local kind cluster and it was working fine during the entire weekend, but today, when I rebooted my PC, the dns is completely down. I tried restart docker, and even manually the CoreDNS container, but doens´t fix the issue.
I got errors like this all over my containers:
dial tcp: lookup notification-controller.flux-system.svc.cluster.local. on 10.96.0.10:53: read udp 10.244.0.3:52830->10.96.0.10:53: read: connection refused"
And it´s not only on the internal network. Even external requests are failing with the same error.
dial tcp: lookup github.com on 10.96.0.10:53: read udp 10.244.0.15:41035->10.96.0.10:53: read: connection refused'
Any idea?
I observe the same issues when using KinD in a WSL2/Windows 11 environment. Example logs from the CoreDNS pod:
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0202 14:14:20.711784 1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: network is unreachable
E0202 14:14:22.917864 1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: network is unreachable
pkg/mod/k8s.io/[email protected]
this is an old version, also wsl2/windows11 environments had some known issue, are you using latest version?
This bug is starting to become a placeholder, I wonder if we should close it an open more specific bugs, is not the same cluster not works after reboot in windows, that with podman, or with lima, ...
Hi @aojea, which component are you saying is outdated?
I'm using kind 0.17.0 and I created the cluster using the command kind create cluster --image kindest/node:v1.21.14@sha256:9d9eb5fb26b4fbc0c6d95fa8c790414f9750dd583f5d7cee45d92e8c26670aa1 which is listed as a supported image in the 0.17.0 release.
I don't believe any of the WSL2 known issues are related to this? They all seem to be related to Docker Desktop behaviour.