microk8s
microk8s copied to clipboard
microk8s enable nvidia is not complete
Summary
I have a 3 node cluster running microk8s 1.29.4 with a nvidia RTX 3060 in gpu01 node.
$ microk8s.kubectl get nodes
NAME STATUS ROLES AGE VERSION
gpu01 Ready <none> 47m v1.29.4
mm321 Ready <none> 57m v1.29.4
mm322 Ready <none> 48m v1.29.4
On executing microk8s enable nvidia on master node (mm321), some of the pods related to gpu operator are stuck in Init state
mm321:~$ microk8s.kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator-resources gpu-feature-discovery-9kgs4 0/1 Init:0/1 0 11m
gpu-operator-resources gpu-operator-999cc8dcc-qjkf2 1/1 Running 0 12m
gpu-operator-resources gpu-operator-node-feature-discovery-gc-7cc7ccfff8-rhsjx 1/1 Running 0 12m
gpu-operator-resources gpu-operator-node-feature-discovery-master-d8597d549-s2dpl 1/1 Running 0 12m
gpu-operator-resources gpu-operator-node-feature-discovery-worker-czp56 1/1 Running 0 12m
gpu-operator-resources gpu-operator-node-feature-discovery-worker-gth9b 1/1 Running 0 12m
gpu-operator-resources gpu-operator-node-feature-discovery-worker-vz46v 1/1 Running 0 12m
gpu-operator-resources nvidia-container-toolkit-daemonset-44fxd 0/1 Init:CrashLoopBackOff 7 (58s ago) 11m
gpu-operator-resources nvidia-dcgm-exporter-7cc72 0/1 Init:0/1 0 11m
gpu-operator-resources nvidia-device-plugin-daemonset-jjvnr 0/1 Init:0/1 0 11m
gpu-operator-resources nvidia-operator-validator-mcxj2 0/1 Init:0/4 0 11m
ingress nginx-ingress-microk8s-controller-7q8jn 1/1 Running 0 49m
ingress nginx-ingress-microk8s-controller-pj44d 1/1 Running 0 54m
ingress nginx-ingress-microk8s-controller-wm9b9 1/1 Running 0 48m
kube-system calico-kube-controllers-77bd7c5b-ksrq6 1/1 Running 0 58m
kube-system calico-node-b8fql 1/1 Running 0 48m
kube-system calico-node-fm9qz 1/1 Running 0 49m
kube-system calico-node-j82ml 1/1 Running 0 49m
kube-system coredns-864597b5fd-8tzxm 1/1 Running 0 58m
kube-system hostpath-provisioner-756cd956bc-t78f9 1/1 Running 1 (49m ago) 54m
metallb-system controller-5f7bb57799-gs4vm 1/1 Running 0 54m
metallb-system speaker-5g865 1/1 Running 0 49m
metallb-system speaker-ld7cc 1/1 Running 0 54m
metallb-system speaker-sv2nd 1/1 Running 0 48m
What Should Happen Instead?
Pods should not be stuck in init state
Reproduction Steps
- Install microk8s 1.29.4 snap in all the nodes
- Add nodes mm232 and gpu01 to mm231
- in mm231, microk8s enable nvidia
Introspection Report
$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6809/var/kubernetes/backend/localnode.yaml': No such file or directory
WARNING: Maximum number of inotify user watches is less than the recommended value of 1048576.
Increase the limit with:
echo fs.inotify.max_user_watches=1048576 | sudo tee -a /etc/sysctl.conf
sudo sysctl --system
Building the report tarball
Report tarball is at /var/snap/microk8s/6809/inspection-report-20240630_000040.tar.gz
inspection-report-20240630_000040.tar.gz
Can you suggest a fix?
No
Are you interested in contributing with a fix?
No
Workaround: I am using 545 driver in gpu01 device. On downgrading to 535 driver issue seems to be resolved.