kubespray
kubespray copied to clipboard
Pods faiiling after restart of VM
Environment:
-
hardware configuration:
-
OS Ubuntu 20.04.4 LTS
-
Version of Ansible: ansible 2.10.15
-
Version of Python3: Python 3.8.10
Kubespray version (commit): 2cc5f04b
Full inventory with variables:
all:
hosts:
node1:
ansible_host: 192.168.2.211
ip: 192.168.2.211
access_ip: 192.168.2.211
node2:
ansible_host: 192.168.2.212
ip: 192.168.2.212
access_ip: 192.168.2.212
node3:
ansible_host: 192.168.2.213
ip: 192.168.2.213
access_ip: 192.168.2.213
node4:
ansible_host: 192.168.2.214
ip: 192.168.2.214
access_ip: 192.168.2.214
children:
kube_control_plane:
hosts:
node1:
node2:
kube_node:
hosts:
node1:
node2:
node3:
node4:
etcd:
hosts:
node1:
node2:
node3:
k8s_cluster:
children:
kube_control_plane:
kube_node:
calico_rr:
hosts: {}
Command used to invoke ansible:
ansible-playbook -i inventory/newCluster/hosts.yaml --become --become-user=root cluster.yml
Output of ansible run:
Anything else do we need to know: After I rebooted my VM where I was installed master k8s node - all pods can't up because this error "Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice""
HI , @dimakyriakov . I try it at ubuntu 20.04.4 LTS. Everything is OK. According to https://github.com/kubernetes/minikube/issues/5223. Would you please give me more information about the docker's cgroupdriver config, and the kubelet's cgroupdriver config
Hello, @yankay
piece of k8s-cluster.yml file:
docker-options.conf:
I don't have daemon.json by default
Kubelet is down after reboot:
kubelet.env:
same problem here on a set of ubuntu 20.04 desktop VMs with kubespray commit c24a3a3b152d41f88bd48c9e6f24fd132fd4a78a and kube version 1.24.3
Install went fine with the follwing command on a setup with a single master node and 2 worker nodes.
ansible-playbook -T 20 -i inventory/local/hosts.yml cluster.yml -v -e container_manager=docker
Reboot is CHAOS !
- I was forced to apply this tweaks on worker nodes for them to restart properly :
sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service
# after each reboot DNS may be broken. If 192.168.122.1 was your DNS:
sudo systemd-resolve --interface enp1s0 --set-dns 192.168.122.1
for f in admin.conf controller-manager.conf kubelet.conf scheduler.conf ; do
sudo [ `ls /etc/kubernetes/$f.* 2>/dev/null | wc -l` -eq 1 ] && sudo cp /etc/kubernetes/$f /etc/kubernetes/bk-$f && sudo cp /etc/kubernetes/$f.* /etc/kubernetes/$f
done
- I cannot restart my single master node due to this issue on docker slice :
[root@k8s-master-1]> docker start k8s_kube-apiserver_kube-apiserver-k8s-master-1_kube-system_b4449e3724891f3d586ad7a5f50b28b5_0
Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice"
Error: failed to start containers: k8s_kube-apiserver_kube-apiserver-k8s-master-1_kube-system_b4449e3724891f3d586ad7a5f50b28b5_0
journalctl -r -u kubelet
-- Logs begin at Mon 2022-07-25 15:33:06 CEST, end at Thu 2022-07-28 10:23:22 CEST. --
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: I0728 10:23:22.157936 1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: E0728 10:23:22.146025 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: E0728 10:23:22.045172 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.968052 1938 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://127.0.0.1:6443/api/v1/nodes\": dial tcp 127.0.0>
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967381 1938 kubelet_node_status.go:70] "Attempting to register node" node="k8s-master-1"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967241 1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientPID"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967069 1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasNoDiskPressure"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.966780 1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientMemory"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.957578 1938 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.944303 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.842241 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.740667 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.640006 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.539524 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.438500 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.337455 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.236156 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.157571 1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.135497 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.035193 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.934746 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.833561 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.732640 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.631545 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.530963 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.430640 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.329996 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299770 1938 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-k8s-master-1_kube-sy>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299729 1938 kuberuntime_manager.go:815] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create a sandbox for pod \>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299700 1938 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create a sandbox for pod \>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299635 1938 remote_runtime.go:212] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create a sandbox for >
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295429 1938 kuberuntime_manager.go:488] "No ready sandbox for pod can be found. Need to start a new one" pod="kube-system/kube-controller-manager-k8>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295032 1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientPID"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295019 1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasNoDiskPressure"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.294948 1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientMemory"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.285246 1938 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.228558 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.158092 1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.127582 1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
I think it may be caused by the fact that cgroupv2 are disabled !
ll /sys/fs/cgroup/cgroup.controllers
ls: cannot access '/sys/fs/cgroup/cgroup.controllers': No such file or directory
etcd is running fine.
enabling cgroupv2 makes reboot of master nodes possible ... :( Pre-req : make sure cgroupv2 is enabled on hosts :
# if file exists cgroupv2 is OK
ll /sys/fs/cgroup/cgroup.controllers
# enable:
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX=
sudo sed -i -e 's/^GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"/' /etc/default/grub
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX=
sudo update-grub
The kubernetes cluster reboot but I still have the coredns pods that do not restart :
│ 2022/07/28 12:26:10 [INFO] Skipping kube-dns configmap sync as no directory was specified
│ .:53 on 169.254.25.10
│ cluster.local.:53 on 169.254.25.10
│ in-addr.arpa.:53 on 169.254.25.10
│ ip6.arpa.:53 on 169.254.25.10
│ [INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e
│ [FATAL] plugin/loop: Loop (169.254.25.10:59357 -> 169.254.25.10:53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 4659600271498259777.1850538492869913665." │
│ stream closed
Logs of coredns on the worker node that did not restart shows :
│ 2022/07/28 10:52:52 [INFO] Skipping kube-dns configmap sync as no directory was specified
│ cluster.local.:53 on 169.254.25.10
│ in-addr.arpa.:53 on 169.254.25.10
│ ip6.arpa.:53 on 169.254.25.10
│ .:53 on 169.254.25.10
│ [INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e
│ CoreDNS-1.7.0
I still need this tweaks on all nodes :
sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service
issue with coredns seems to be linked to https://github.com/kubernetes-sigs/kubespray/issues/5835
As @k8s-ci-robot mentioned Ubuntu 20.04 support is merged in master. In addition to the coredns loop error, there are a couple of other things that can come up depending on your environment: - IPVS mode not supported with the KVM kernel in Ubuntu 20.04 - Mitogen + Ubuntu 20.04 requires to specific the python interpreter path - enable_nodelocaldns: false required in some instances
I use KVM...
enable_nodelocaldns: false
does not solve the issue with the core dns crash loop.
The crash loop now occurs only on master nodes and not on every nodes.
using iptables instead of ipvs does not solve the core dns crash loop after reboot.
From the https://coredns.io/plugins/loop/#troubleshooting link it seems that disabling systemd-resolved is worth a shot ...
Hourra it works !
Great if you found a way to fix your coredns issue 👍
@floryut do you know is the test suite include a simple restart of the cluster ?
@floryut do you know is the test suite include a simple restart of the cluster ?
They do not, it would be possible to add ones though. But come to think of your issue it is strange that you have to disable systemd-resolved, to have coredns work pointing the resolv.conf to /run/systemd/resolve/resolv.conf
was enough to fix coredns issue on Ubuntu AFAIK
The issues with coredns appeared only after restarting. The initial install went fine.
I'll try to spinup a ubuntu cluster to see if I can reproduce this behavior but as you're the first to report it, it would be strange if this was a bug in our codebase
From the https://coredns.io/plugins/loop/#troubleshooting link it seems that disabling systemd-resolved is worth a shot ...
Hourra it works !
I had a similar problem This seems to be related to #3979
same problem here on debian
sudo systemctl disablesystemd-resolved
not solve the issue...
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.