kind DNS not working after reboot

DNS not working after reboot

Open hobyte opened this issue 3 years ago • 8 comments

What happened: I created a new kind cluster, then rebooted my computer. After reboot, the dns cannot resolve adresses

What you expected to happen:

dns can resolve adresses

How to reproduce it (as minimally and precisely as possible):

create a new kind cluster
test dns: it's working
reboot your machine (dont't stop docker before reboot)
test dns again:

#APISERVER=https://kubernetes.default.svc
#SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
#NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
#TOKEN=$(cat ${SERVICEACCOUNT}/token)
#CACERT=${SERVICEACCOUNT}/ca.crt
#curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api
curl: (6) Could not resolve host: kubernetes.default.svc

Taken from https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/#without-using-a-proxy

Anything else we need to know?:

dns pods are running
dns logs:

.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.0
linux/amd64, go1.15.3, 054c9ae

dns lookup:

#nslookup kubernetes.default
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1
#nslookup kubernetes.default.svc
;; connection timed out; no servers could be reached

rslov.conf:

#cat /etc/resolv.conf 
search default.svc.cluster.local svc.cluster.local cluster.local fritz.box
nameserver 10.96.

Environment:

kind version: (use kind version): kind v0.11.1 go1.16.4 linux/amd64
Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Docker version: (use docker info): Client: Context: default Debug Mode: false

Server: Containers: 5 Running: 2 Paused: 0 Stopped: 3 Images: 11 Server Version: 20.10.6-ce Storage Driver: btrfs Build Version: Btrfs v4.15 Library Version: 102 Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: oci runc io.containerd.runc.v2 io.containerd.runtime.v1.linux Default Runtime: runc Init Binary: docker-init containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec init version: Security Options: apparmor seccomp Profile: default Kernel Version: 5.3.18-59.16-default Operating System: openSUSE Leap 15.3 OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 7.552GiB Name: Proxima-Centauri ID: M6J5:OLHQ:FXVM:M7WG:2OUA:SKGW:UCF5:DWJZ:4M7T:YA2W:6FBT:DOLG Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

WARNING: No swap limit support

OS (e.g. from /etc/os-release): NAME="openSUSE Leap" VERSION="15.3" ID="opensuse-leap" ID_LIKE="suse opensuse" VERSION_ID="15.3" PRETTY_NAME="openSUSE Leap 15.3" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:opensuse:leap:15.3" BUG_REPORT_URL="https://bugs.opensuse.org" HOME_URL="https://www.opensuse.org/"

Jul 23 '21 08:07 hobyte

I assume this snippet was a copy paste error, is missin the latest 2 digits for the ip address

#cat /etc/resolv.conf search default.svc.cluster.local svc.cluster.local cluster.local fritz.box nameserver 10.96.

are you using one node or mulitple nodes in the cluster? clusters with multiple nodes doesn't handle the reboots

Jul 27 '21 06:07 aojea

Hi also running into this issue! Although I'm not sure that this is caused by a restart for me necessarily.

$  kubectl run -it --rm --restart=Never busybox1 --image=busybox sh
If you don't see a command prompt, try pressing enter.
/ # nslookup kubernetes.default
Server:		10.96.0.10
Address:	10.96.0.10:53

** server can't find kubernetes.default: NXDOMAIN

*** Can't find kubernetes.default: No answer

/ #

Here is what I get when I inspect the kind network

$ docker network inspect kind
[
    {
        "Name": "kind",
        "Id": "7d815ef0d0c4adc297aa523aa3336ba89bc6d7212373d3098f12169618c16563",
        "Created": "2021-08-24T16:41:41.258730207-07:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": true,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                },
                {
                    "Subnet": "fc00:f853:ccd:e793::/64"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "1c47d1b38fe7b0b75e71c21c150aba4d5110ade54d74e2f3db45c5d15d013c59": {
                "Name": "konvoy-capi-bootstrapper-control-plane",
                "EndpointID": "4b176452133a1881380cae8b3fc55963ec0427ee809bc1b678d261f3c1711931",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": "fc00:f853:ccd:e793::2/64"
            }
        },
        "Options": {
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.driver.mtu": "1454"
        },
        "Labels": {}
    }
]

$ kind get nodes --name konvoy-capi-bootstrapper
konvoy-capi-bootstrapper-control-plane

output from ip addr

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
    link/ether 48:2a:e3:0a:7a:8c brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 30:24:32:43:a0:e9 brd ff:ff:ff:ff:ff:ff
    inet 192.168.42.76/24 brd 192.168.42.255 scope global dynamic noprefixroute wlp2s0
       valid_lft 83634sec preferred_lft 83634sec
    inet6 fe80::c3e2:7427:34c8:c265/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
25: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1400 qdisc noqueue state DOWN group default 
    link/ether 02:42:0c:bc:be:aa brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
28: br-7d815ef0d0c4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue state UP group default 
    link/ether 02:42:08:aa:2f:bb brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global br-7d815ef0d0c4
       valid_lft forever preferred_lft forever
    inet6 fc00:f853:ccd:e793::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::42:8ff:feaa:2fbb/64 scope link 
       valid_lft forever preferred_lft forever
    inet6 fe80::1/64 scope link 
       valid_lft forever preferred_lft forever
30: vethba7cc46@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue master br-7d815ef0d0c4 state UP group default 
    link/ether 82:3a:43:df:a0:c1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::803a:43ff:fedf:a0c1/64 scope link 
       valid_lft forever preferred_lft forever

finally logs from a coredns pod

35365->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:36799->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:55841->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:38716->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:51342->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:46009->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:33070->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:34194->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:56925->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:35681->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:42683->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:40842->172.18.0.1:53: i/o timeout

Aug 25 '21 00:08 faiq

Hey, for us, the same issue happens after stopping/rebooting docker. The same issue keeps reproducing with 2 different hosts @romansworks

Edit: we're running a single node setup, with the following config (copied from the website):

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        protocol: TCP

Nov 06 '21 23:11 AlmogBaku

@AlmogBaku I still can't reproduce this in any of our environments. We need to know more about yours.

Nov 08 '21 15:11 BenTheElder

That usually happens after a few times I'm closing the Docker.

Both me and @RomansWorks are using macOS

Nov 08 '21 18:11 AlmogBaku

I have the same issue here in my dev environment... the weird thing is when I connect into the pod using bash and try nslookup the DNS works as you can see in the image below:

But when I try it into my application it can not be solved and everything just doesn't work... and there is no error returned (that is weird too)

Although, if I use the POD IP it works normally...

My stack is:

Docker 20.10.11
K8s 1.21.1 (kindest/node default, but I already tested with all others supported versions)
Kind 0.11.1 (single cluster)

NOTES:

I created the kind cluster using the script with the local registry that can be found here https://kind.sigs.k8s.io/docs/user/local-registry/

Dec 08 '21 18:12 alexandresgf

@alexandresgf please don't use screenshot, those are hard to read.

Is this problem happening after reboot or it never worked?

Dec 08 '21 18:12 aojea

@alexandresgf please don't use screenshot, those are hard to read.

Sorry for that!

Is this problem happening after reboot or it never worked?

At first it worked for a while, then sundenlly it happened after a reboot and the DNS never worked anymore even I removing the kind completely and doing a fresh install.

Dec 10 '21 14:12 alexandresgf

I got a similar problem. I created a local kind cluster and it was working fine during the entire weekend, but today, when I rebooted my PC, the dns is completely down. I tried restart docker, and even manually the CoreDNS container, but doens´t fix the issue.

I got errors like this all over my containers:

 dial tcp: lookup notification-controller.flux-system.svc.cluster.local. on 10.96.0.10:53: read udp 10.244.0.3:52830->10.96.0.10:53: read: connection refused"

And it´s not only on the internal network. Even external requests are failing with the same error.

dial tcp: lookup github.com on 10.96.0.10:53: read udp 10.244.0.15:41035->10.96.0.10:53: read: connection refused'

Any idea?

Oct 17 '22 20:10 brpaz

I observe the same issues when using KinD in a WSL2/Windows 11 environment. Example logs from the CoreDNS pod:

[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0202 14:14:20.711784       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: network is unreachable
E0202 14:14:22.917864       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: network is unreachable

Feb 02 '23 14:02 ben-foxmoore

pkg/mod/k8s.io/[email protected]

this is an old version, also wsl2/windows11 environments had some known issue, are you using latest version?

This bug is starting to become a placeholder, I wonder if we should close it an open more specific bugs, is not the same cluster not works after reboot in windows, that with podman, or with lima, ...

Feb 02 '23 15:02 aojea

Hi @aojea, which component are you saying is outdated?

I'm using kind 0.17.0 and I created the cluster using the command kind create cluster --image kindest/node:v1.21.14@sha256:9d9eb5fb26b4fbc0c6d95fa8c790414f9750dd583f5d7cee45d92e8c26670aa1 which is listed as a supported image in the 0.17.0 release.

I don't believe any of the WSL2 known issues are related to this? They all seem to be related to Docker Desktop behaviour.

Feb 02 '23 16:02 ben-foxmoore

kind kind copied to clipboard

DNS not working after reboot

kind
kind copied to clipboard