CoreDNS does not start after host reboot

Open etm-de opened this issue 10 months ago • 1 comments

Summary

I am running microk8s on an Ubuntu Amazon Workspace. I can get CoreDNS to start once, but if I reboot the machine, the CoreDNS pod no longer starts. I suspect this may be a peculiarity to Amazon Workspaces. I'm hoping you can give me tips to debug the problem.

Note that I have ha-cluster disabled in the examples below, but I have seen the same thing with it enabled.

$ microk8s kubectl -n kube-system get po
NAME                       READY   STATUS    RESTARTS      AGE
coredns-79b94494c7-qqqzn   0/1     Running   1 (28m ago)   33m

Here are the CoreDNS logs:

[INFO] 127.0.0.1:46655 - 23633 "HINFO IN 8185052869104737298.5565547917880098493. udp 57 false 512" - - 0 2.00032851s
[ERROR] plugin/errors: 2 8185052869104737298.5565547917880098493. HINFO: read udp 10.1.94.8:44798->172.31.254.165:53: i/o timeout
[INFO] 127.0.0.1:44082 - 18542 "HINFO IN 8185052869104737298.5565547917880098493. udp 57 false 512" - - 0 2.001442862s
[ERROR] plugin/errors: 2 8185052869104737298.5565547917880098493. HINFO: read udp 10.1.94.8:59424->172.31.253.215:53: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.152.183.1:443/version": dial tcp 10.152.183.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"

I also see

$ microk8s kubectl get svc -n kube-system
NAME       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.152.183.10   <none>        53/UDP,53/TCP,9153/TCP   5m57s

$ microk8s kubectl  get ep -n kube-system
NAME       ENDPOINTS   AGE
kube-dns               4m48s

I have tried

microk8s kubectl -n kube-system rollout restart deploy

and re-enabling the DNS plugin, but it doesn't resolve the issue. I also tried

sudo ufw allow in on vxlan.calico && sudo ufw allow out on vxlan.calico
sudo ufw allow in on cali+ && sudo ufw allow out on cali+

and

sudo ufw default allow routed

That did not seem to help.

What Should Happen Instead?

I should be able to restart my machine and have CoreDNS continue to run.

Reproduction Steps

Install microk8s sudo snap install microk8s --classic
Check that CoreDNS pod is running microk8s kubectl -n kube-system get po
Reboot the machine
CoreDNS pod is no longer running microk8s kubectl -n kube-system get po

Introspection Report

Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-flanneld is running
  Service snap.microk8s.daemon-etcd is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster

I did not attach the tarball because I do not know enough about the security implications.

Can you suggest a fix?

Are you interested in contributing with a fix?

Feb 19 '25 17:02 etm-de

Hi @etm-de,

I don't know too much about amazon workspaces unfortunately. Do they use AMIs? If so, are you deploying on an Ubuntu AMI or an Amazon Linux one? Can you look in journalctl -u snap.microk8s.daemon-kubelite if there's anything interesting?

I tried reproducing your problem using LXD containers but everything was fine after reboot.

Feb 21 '25 22:02 eaudetcobello