deepops
deepops copied to clipboard
Ansible playbook failing to add RHEL 8 DGX Node in K8s cluster
Issue: Ansible playbook failing to add RHEL 8 DGX Node in K8s cluster, kubelet service is getting crashed.
Issue Details: One of the DGX RHEL7 node got failed in cluster due to hardware issue, We have replaced non working Physical DGX node in cluster after installing RHEL8 OS , but when we are trying to add the newly rebuild node ansible playbook giving "failed to reload cni configuration" Error. Suspecting CNI is not installed properly.
Deepops Version: release-22.04 Kubernetes Version: K8s v23.5
Error: Contained Logs on Node:
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.694389943-07:00" level=info msg="containerd successfully boote>
Aug 11 00:12:59 dgxg20.example.com systemd[1]: Started containerd container runtime.
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714280076-07:00" level=info msg="Start event monitor"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714326620-07:00" level=info msg="Start snapshots syncer"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714338153-07:00" level=info msg="Start cni network conf syncer"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714345598-07:00" level=info msg="Start streaming server"
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.482783309-07:00" level=error msg="failed to reload cni configu>
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.483408806-07:00" level=error msg="failed to reload cni configu>
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.483531403-07:00" level=error msg="failed to reload cni configu
Kubelet service Not starting:
$ systemctl status kubelet.service
● kubelet.service - Kubernetes Kubelet Server
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Mon 2024-08-12 06:01:56 PDT; 5s ago
Docs:
https://github.com/GoogleCloudPlatform/kubernetes
Process: 1416640 ExecStart=/usr/local/bin/kubelet $KUBE_LOGTOSTDERR $KUBE_LOG_LEVEL $KUBELET_API_SERVER $KUBELET_ADDRESS $KUBELET_PORT $KUBELET_HOSTNAME $KUBELE>
Main PID: 1416640 (code=exited, status=1/FAILURE)
Tasks: 0 (limit: 3297916)
Memory: 0B
CGroup: /system.slice/kubelet.service
Containerd service Status:
$ systemctl status containerd
● containerd.service - containerd container runtime
Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2024-08-11 00:12:59 PDT; 1 day 5h ago
Docs: https://containerd.io
Main PID: 235710 (containerd)
Tasks: 50
Memory: 26.0M
CGroup: /system.slice/containerd.service
└─235710 /usr/local/bin/containerd
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.693829342-07:00" level=info msg=serving... address=/run/containerd/c>
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.694389943-07:00" level=info msg="containerd successfully booted in 0>
Aug 11 00:12:59 dgxg20.example.com systemd[1]: Started containerd container runtime.
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714280076-07:00" level=info msg="Start event monitor"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714326620-07:00" level=info msg="Start snapshots syncer"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714338153-07:00" level=info msg="Start cni network conf syncer"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714345598-07:00" level=info msg="Start streaming server"
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.482783309-07:00" level=error msg="failed to reload cni configuration>
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.483408806-07:00" level=error msg="failed to reload cni configuration>
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.483531403-07:00" level=error msg="failed to reload cni configuration>