gpu-operator
gpu-operator copied to clipboard
nvidia-driver-daemonset stuck in Init:CrashLoopBackOff (again)
I'm on Ubuntu 22.04. I'm installing everything with the following Ansible script, including microk8s 1.24/edge.
---
- hosts: "{{ host | default('localhost')}}"
become: yes
become_method: sudo
tasks:
- name: installing CUDA from NVIDIA
shell: |
cd /tmp
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
add-apt-repository -y "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
apt-get update
apt-get -y install cuda
- name: installing microk8s
snap:
name: microk8s
channel: 1.24/edge
classic: yes
state: present
- shell: /snap/bin/microk8s start
- shell: /snap/bin/microk8s enable gpu
Note that channel 1.24/edge is needed due to unrelated bugs in microk8s.
I have installed this on three machines. It works fine on two of the (each having two 2080 cards). I can kubectl apply the vector-add example on both of them.
The third machine has a 3090 GPU and it is failing. It appears there is a problem with the k8s-driver-manager:
syslog:Jun 29 21:44:00 varuna microk8s.daemon-kubelite[77992]: E0629 21:44:00.085442 77992 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"k8s-driver-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=k8s-driver-manager pod=nvidia-driver-daemonset-f482n_gpu-operator-resources(706e7a7c-d8e0-40e4-b57b-f32813ae8a0d)\"" pod="gpu-operator-resources/nvidia-driver-daemonset-f482n" podUID=706e7a7c-d8e0-40e4-b57b-f32813ae8a0d
This is everything that's running on the microk8s installation:
varuna:log$ microk8s.kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default vector-add 0/1 Pending 0 6h14m
gpu-operator-resources gpu-operator-node-feature-discovery-worker-d2776 1/1 Running 37 (6m57s ago) 6h20m
kube-system calico-node-mrdfg 1/1 Running 1 (5h54m ago) 6h22m
kube-system calico-kube-controllers-5755cd6ddb-kv9bx 1/1 Running 0 7m8s
gpu-operator-resources gpu-operator-798c6ddc97-tgd8n 1/1 Running 0 7m8s
gpu-operator-resources gpu-operator-node-feature-discovery-master-6c65c99969-bxmp2 1/1 Running 0 7m8s
kube-system coredns-66bcf65bb8-lf69h 1/1 Running 0 7m8s
kube-system metrics-server-5f8f64cb86-thks7 1/1 Running 0 6m52s
kube-system kubernetes-dashboard-765646474b-f2bk5 1/1 Running 0 5m43s
kube-system dashboard-metrics-scraper-6b6f796c8d-5gfcs 1/1 Running 0 5m43s
gpu-operator-resources nvidia-dcgm-exporter-bgv7b 0/1 Init:0/1 0 2m23s
gpu-operator-resources nvidia-device-plugin-daemonset-6pfs7 0/1 Init:0/1 0 2m23s
gpu-operator-resources gpu-feature-discovery-r8fbh 0/1 Init:0/1 0 2m23s
gpu-operator-resources nvidia-operator-validator-5zz4n 0/1 Init:0/4 0 2m23s
gpu-operator-resources nvidia-container-toolkit-daemonset-cpptw 0/1 Init:0/1 0 2m23s
gpu-operator-resources nvidia-driver-daemonset-f482n 0/1 Init:CrashLoopBackOff 75 (2m23s ago) 6h19m
varuna:log$
It appears something is trying to "unload the driver". The other nodes are running the same driver, the latest from the NVIDIA CUDA repo (515.48.07).
This is the logs from the nvidia-driver-daemonset containers:
varuna:log$ microk8s.kubectl logs pod/nvidia-driver-daemonset-f482n -n gpu-operator-resources --all-containers
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
nvidia driver module is already loaded with refcount 236
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/varuna labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-5zz4n condition met
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Unloading NVIDIA driver kernel modules...
nvidia_drm 69632 5
drm_kms_helper 307200 1 nvidia_drm
nvidia_uvm 1282048 0
nvidia_modeset 1142784 7 nvidia_drm
nvidia 40800256 236 nvidia_uvm,nvidia_modeset
drm 606208 9 drm_kms_helper,nvidia,nvidia_drm
Could not unload NVIDIA driver kernel modules, driver is in use
Unable to cleanup driver modules, attempting again with node drain...
Draining node varuna...
node/varuna cordoned
error: unable to drain node "varuna" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-5f8f64cb86-thks7, kube-system/kubernetes-dashboard-765646474b-f2bk5, kube-system/dashboard-metrics-scraper-6b6f796c8d-5gfcs, continuing command...
There are pending nodes to be drained:
varuna
cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-5f8f64cb86-thks7, kube-system/kubernetes-dashboard-765646474b-f2bk5, kube-system/dashboard-metrics-scraper-6b6f796c8d-5gfcs
Uncordoning node varuna...
node/varuna uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/varuna unlabeled
varuna:log$
The difference appears to be that the machine with the 3090 card actually tries to run an nvidia-driver-daemonset (which tries to unload the driver and is failing), while the machines with the 2080 cards don't. Why they behave differently is beyond me, since both cards are fairly old by now and I have the latest kernel drivers installed; there shouldn't be any need to unload/reload the driver on any of them.
All machines run GPU containers fine under docker and podman, so the driver is perfectly functional.
I have tried booting the machine in text mode (no processes using the GPU according to nvidia-smi), and the nvidia-driver-daemonset fails in the same way.
varuna:~$ nvidia-smi
Wed Jun 29 22:35:02 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 0% 34C P8 15W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
varuna:~$
So, how can I fix this? Is there some way to tell the gpu-operator not to even attempt to run the nvidia-driver-daemonset? Any other suggestions?
@tmbdev Please install with driver container disabled as you seem to have drivers pre-installed on the node already. --set driver.enabled=false. Or use latest versions of operator v1.11.0 where driver container will detect this and will stay in init phase.
Also, note that you don't have to pre-install the drivers in the first place and operator takes care of it.
- name: installing CUDA from NVIDIA
shell: |
cd /tmp
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
add-apt-repository -y "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
apt-get update
apt-get -y install cuda
Ah, I see, just adding the option works: microk8s enable gpu --set driver.enabled=false
This leaves the mystery of why this was working on two machines and failing on one.
There is no choice: Ubuntu desktop machines necessarily have the NVIDIA drivers installed; the only question is whether it's the Ubuntu repo or the NVIDIA repo.
Microk8s is frequently used on desktop machines, with a docker and podman installation and other GPU software; that's another reason a driver needs to be preinstalled.
got it, seems like this option needs to be updated with microk8s installs. This problem will not happen with v1.11.0 of operator as it will not try to overwrite drivers if they are already pre-installed.
What helped me was to identify the pods from this error (i.e., your error logs):
cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-5f8f64cb86-thks7, kube-system/kubernetes-dashboard-765646474b-f2bk5, kube-system/dashboard-metrics-scraper-6b6f796c8d-5gfcs
- Cordon the node
- Remove the pods manually mentioned in the error
- [restart the nvidia-driver-daemonset pod] (not sure if this is necessary)
Afterwards, when the pod was restarted it actually drained the node successfully, then installed the drivers, and automatically uncordoned the node afterwards.