deepops icon indicating copy to clipboard operation
deepops copied to clipboard

Ansible playbook failing to add RHEL 8 DGX Node in K8s cluster

Open subasathees opened this issue 6 months ago • 0 comments

Issue: Ansible playbook failing to add RHEL 8 DGX Node in K8s cluster, kubelet service is getting crashed.

Issue Details: One of the DGX RHEL7 node got failed in cluster due to hardware issue, We have replaced non working Physical DGX node in cluster after installing RHEL8 OS , but when we are trying to add the newly rebuild node ansible playbook giving "failed to reload cni configuration" Error. Suspecting CNI is not installed properly.

Deepops Version: release-22.04 Kubernetes Version: K8s v23.5

Error: Contained Logs on Node:

Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.694389943-07:00" level=info msg="containerd successfully boote>
Aug 11 00:12:59 dgxg20.example.com systemd[1]: Started containerd container runtime.
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714280076-07:00" level=info msg="Start event monitor"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714326620-07:00" level=info msg="Start snapshots syncer"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714338153-07:00" level=info msg="Start cni network conf syncer"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714345598-07:00" level=info msg="Start streaming server"
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.482783309-07:00" level=error msg="failed to reload cni configu>
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.483408806-07:00" level=error msg="failed to reload cni configu>
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.483531403-07:00" level=error msg="failed to reload cni configu

Kubelet service Not starting:

$ systemctl status kubelet.service
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Mon 2024-08-12 06:01:56 PDT; 5s ago
     Docs:
https://github.com/GoogleCloudPlatform/kubernetes
  Process: 1416640 ExecStart=/usr/local/bin/kubelet $KUBE_LOGTOSTDERR $KUBE_LOG_LEVEL $KUBELET_API_SERVER $KUBELET_ADDRESS $KUBELET_PORT $KUBELET_HOSTNAME $KUBELE>
Main PID: 1416640 (code=exited, status=1/FAILURE)
    Tasks: 0 (limit: 3297916)
   Memory: 0B
   CGroup: /system.slice/kubelet.service

Containerd service Status:

$ systemctl status containerd
● containerd.service - containerd container runtime
   Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2024-08-11 00:12:59 PDT; 1 day 5h ago
     Docs: https://containerd.io
Main PID: 235710 (containerd)
    Tasks: 50
   Memory: 26.0M
   CGroup: /system.slice/containerd.service
           └─235710 /usr/local/bin/containerd
 
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.693829342-07:00" level=info msg=serving... address=/run/containerd/c>
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.694389943-07:00" level=info msg="containerd successfully booted in 0>
Aug 11 00:12:59 dgxg20.example.com systemd[1]: Started containerd container runtime.
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714280076-07:00" level=info msg="Start event monitor"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714326620-07:00" level=info msg="Start snapshots syncer"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714338153-07:00" level=info msg="Start cni network conf syncer"
Aug 11 00:12:59 dgxg20.example.com containerd[235710]: time="2024-08-11T00:12:59.714345598-07:00" level=info msg="Start streaming server"
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.482783309-07:00" level=error msg="failed to reload cni configuration>
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.483408806-07:00" level=error msg="failed to reload cni configuration>
Aug 11 00:16:54 dgxg20.example.com containerd[235710]: time="2024-08-11T00:16:54.483531403-07:00" level=error msg="failed to reload cni configuration>

subasathees avatar Aug 12 '24 13:08 subasathees