kubespray Pods faiiling after restart of VM

Environment:

hardware configuration:
OS Ubuntu 20.04.4 LTS
Version of Ansible: ansible 2.10.15
Version of Python3: Python 3.8.10

Kubespray version (commit): 2cc5f04b

Full inventory with variables:

all:
  hosts:
    node1:
      ansible_host: 192.168.2.211
      ip: 192.168.2.211
      access_ip: 192.168.2.211
    node2:
      ansible_host: 192.168.2.212
      ip: 192.168.2.212
      access_ip: 192.168.2.212
    node3:
      ansible_host: 192.168.2.213
      ip: 192.168.2.213
      access_ip: 192.168.2.213
    node4:
      ansible_host: 192.168.2.214
      ip: 192.168.2.214
      access_ip: 192.168.2.214
  children:
    kube_control_plane:
      hosts:
        node1:
        node2:
    kube_node:
      hosts:
        node1:
        node2:
        node3:
        node4:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Command used to invoke ansible: ansible-playbook -i inventory/newCluster/hosts.yaml --become --become-user=root cluster.yml

Output of ansible run:

Anything else do we need to know: After I rebooted my VM where I was installed master k8s node - all pods can't up because this error "Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice""

May 20 '22 11:05 dimakyriakov

HI , @dimakyriakov . I try it at ubuntu 20.04.4 LTS. Everything is OK. According to https://github.com/kubernetes/minikube/issues/5223. Would you please give me more information about the docker's cgroupdriver config, and the kubelet's cgroupdriver config

May 24 '22 07:05 yankay

Hello, @yankay piece of k8s-cluster.yml file:

docker-options.conf: I don't have daemon.json by default

Kubelet is down after reboot:

kubelet.env:

May 25 '22 07:05 dimakyriakov

same problem here on a set of ubuntu 20.04 desktop VMs with kubespray commit c24a3a3b152d41f88bd48c9e6f24fd132fd4a78a and kube version 1.24.3

Install went fine with the follwing command on a setup with a single master node and 2 worker nodes.

ansible-playbook -T 20 -i inventory/local/hosts.yml cluster.yml -v -e container_manager=docker

Reboot is CHAOS !

I was forced to apply this tweaks on worker nodes for them to restart properly :

sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service
# after each reboot DNS may be broken. If 192.168.122.1 was your DNS:
sudo systemd-resolve --interface enp1s0 --set-dns 192.168.122.1
for f in admin.conf controller-manager.conf kubelet.conf scheduler.conf ; do
sudo [ `ls /etc/kubernetes/$f.* 2>/dev/null | wc -l` -eq 1 ] && sudo cp /etc/kubernetes/$f /etc/kubernetes/bk-$f && sudo cp /etc/kubernetes/$f.* /etc/kubernetes/$f
done

I cannot restart my single master node due to this issue on docker slice :

[root@k8s-master-1]> docker start k8s_kube-apiserver_kube-apiserver-k8s-master-1_kube-system_b4449e3724891f3d586ad7a5f50b28b5_0
Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice"
Error: failed to start containers: k8s_kube-apiserver_kube-apiserver-k8s-master-1_kube-system_b4449e3724891f3d586ad7a5f50b28b5_0

journalctl -r -u kubelet
-- Logs begin at Mon 2022-07-25 15:33:06 CEST, end at Thu 2022-07-28 10:23:22 CEST. --
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: I0728 10:23:22.157936    1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: E0728 10:23:22.146025    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:22 k8s-master-1 kubelet[1938]: E0728 10:23:22.045172    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.968052    1938 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://127.0.0.1:6443/api/v1/nodes\": dial tcp 127.0.0>
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967381    1938 kubelet_node_status.go:70] "Attempting to register node" node="k8s-master-1"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967241    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientPID"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.967069    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasNoDiskPressure"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.966780    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientMemory"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.957578    1938 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.944303    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.842241    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.740667    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.640006    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.539524    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.438500    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.337455    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.236156    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: I0728 10:23:21.157571    1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.135497    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:21 k8s-master-1 kubelet[1938]: E0728 10:23:21.035193    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.934746    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.833561    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.732640    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.631545    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.530963    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.430640    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.329996    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299770    1938 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-k8s-master-1_kube-sy>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299729    1938 kuberuntime_manager.go:815] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create a sandbox for pod \>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299700    1938 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create a sandbox for pod \>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.299635    1938 remote_runtime.go:212] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create a sandbox for >
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295429    1938 kuberuntime_manager.go:488] "No ready sandbox for pod can be found. Need to start a new one" pod="kube-system/kube-controller-manager-k8>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295032    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientPID"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.295019    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasNoDiskPressure"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.294948    1938 kubelet_node_status.go:563] "Recording event message for node" node="k8s-master-1" event="NodeHasSufficientMemory"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.285246    1938 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.228558    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: I0728 10:23:20.158092    1938 csi_plugin.go:1063] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1>
juil. 28 10:23:20 k8s-master-1 kubelet[1938]: E0728 10:23:20.127582    1938 kubelet.go:2424] "Error getting node" err="node \"k8s-master-1\" not found"

I think it may be caused by the fact that cgroupv2 are disabled !

ll /sys/fs/cgroup/cgroup.controllers
ls: cannot access '/sys/fs/cgroup/cgroup.controllers': No such file or directory

etcd is running fine.

Jul 28 '22 07:07 julienlau

enabling cgroupv2 makes reboot of master nodes possible ... :( Pre-req : make sure cgroupv2 is enabled on hosts :

# if file exists cgroupv2 is OK
ll /sys/fs/cgroup/cgroup.controllers
# enable:
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX=
sudo sed -i -e 's/^GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"/' /etc/default/grub
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX=
sudo update-grub

The kubernetes cluster reboot but I still have the coredns pods that do not restart :

│ 2022/07/28 12:26:10 [INFO] Skipping kube-dns configmap sync as no directory was specified                                                                                                                         
│ .:53 on 169.254.25.10                                                                                                                                                                                             
│ cluster.local.:53 on 169.254.25.10                                                                                                                                                                                
│ in-addr.arpa.:53 on 169.254.25.10                                                                                                                                                                                 
│ ip6.arpa.:53 on 169.254.25.10                                                                                                                                                                                     
│ [INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e   
│ [FATAL] plugin/loop: Loop (169.254.25.10:59357 -> 169.254.25.10:53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 4659600271498259777.1850538492869913665."           │
│ stream closed

Logs of coredns on the worker node that did not restart shows :

│ 2022/07/28 10:52:52 [INFO] Skipping kube-dns configmap sync as no directory was specified                                                                                                                         
│ cluster.local.:53 on 169.254.25.10                                                                                                                                                                                
│ in-addr.arpa.:53 on 169.254.25.10                                                                                                                                                                                 
│ ip6.arpa.:53 on 169.254.25.10                                                                                                                                                                                     
│ .:53 on 169.254.25.10                                                                                                                                                                                             
│ [INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e 
│ CoreDNS-1.7.0

I still need this tweaks on all nodes :

sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service

Jul 28 '22 09:07 julienlau

issue with coredns seems to be linked to https://github.com/kubernetes-sigs/kubespray/issues/5835

As @k8s-ci-robot mentioned Ubuntu 20.04 support is merged in master. In addition to the coredns loop error, there are a couple of other things that can come up depending on your environment: - IPVS mode not supported with the KVM kernel in Ubuntu 20.04 - Mitogen + Ubuntu 20.04 requires to specific the python interpreter path - enable_nodelocaldns: false required in some instances

I use KVM...

Jul 28 '22 12:07 julienlau

enable_nodelocaldns: false does not solve the issue with the core dns crash loop. The crash loop now occurs only on master nodes and not on every nodes.

Jul 28 '22 16:07 julienlau

using iptables instead of ipvs does not solve the core dns crash loop after reboot.

Jul 28 '22 17:07 julienlau

From the https://coredns.io/plugins/loop/#troubleshooting link it seems that disabling systemd-resolved is worth a shot ...

Hourra it works !

Jul 28 '22 18:07 julienlau

Great if you found a way to fix your coredns issue 👍

Jul 29 '22 07:07 floryut

@floryut do you know is the test suite include a simple restart of the cluster ?

Aug 01 '22 13:08 julienlau

@floryut do you know is the test suite include a simple restart of the cluster ?

They do not, it would be possible to add ones though. But come to think of your issue it is strange that you have to disable systemd-resolved, to have coredns work pointing the resolv.conf to /run/systemd/resolve/resolv.conf was enough to fix coredns issue on Ubuntu AFAIK

Aug 01 '22 19:08 floryut

The issues with coredns appeared only after restarting. The initial install went fine.

Aug 02 '22 06:08 julienlau

I'll try to spinup a ubuntu cluster to see if I can reproduce this behavior but as you're the first to report it, it would be strange if this was a bug in our codebase

Aug 02 '22 06:08 floryut

From the https://coredns.io/plugins/loop/#troubleshooting link it seems that disabling systemd-resolved is worth a shot ...

Hourra it works !

I had a similar problem This seems to be related to #3979

Oct 24 '22 03:10 jqiuyin

same problem here on debian

sudo systemctl disablesystemd-resolved

not solve the issue...

Jan 08 '23 23:01 lightoyou

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 08 '23 23:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 09 '23 00:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 08 '23 00:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 08 '23 00:06 k8s-ci-robot

kubespray kubespray copied to clipboard

Pods faiiling after restart of VM

kubespray
kubespray copied to clipboard