gpu-operator
gpu-operator copied to clipboard
nvidia-driver-daemonset always fails on Ubuntu 20.04.2
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [ ] Are you running Kubernetes v1.13+?
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - [ ] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
I'm running
- Ubuntu 20.04.2
- Microk8s v1.21.1
- contained
I don't install any GPU driver on the host machine, the GPU is GeForce RTX 1650
1. Issue or feature description
Enable gpu-operator
in microk8s:
microk8s.enable gpu
The POD nvidia-driver-daemonset
always fails:
kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-7f9c69c78c-hphlc 1/1 Running 1 9h
kube-system calico-node-x76zp 1/1 Running 1 9h
kube-system calico-kube-controllers-f7868dd95-tqjnw 1/1 Running 1 9h
default gpu-operator-node-feature-discovery-master-867c4f7bfb-5wpgk 1/1 Running 0 6m46s
default gpu-operator-node-feature-discovery-worker-msmv2 1/1 Running 0 6m46s
gpu-operator-resources nvidia-operator-validator-rh7h7 0/1 Init:0/4 0 6m8s
gpu-operator-resources nvidia-device-plugin-daemonset-8kjxn 0/1 Init:0/1 0 6m8s
gpu-operator-resources nvidia-dcgm-exporter-kgbq5 0/1 Init:0/1 0 6m8s
gpu-operator-resources gpu-feature-discovery-wvvr2 0/1 Init:0/1 0 6m8s
default gpu-operator-7db468cfdf-4sv48 1/1 Running 0 6m46s
gpu-operator-resources nvidia-container-toolkit-daemonset-7s686 0/1 Init:0/1 0 6m8s
gpu-operator-resources nvidia-driver-daemonset-ck684 0/1 CrashLoopBackOff 5 6m9s
I checked its logs:
kubectl logs nvidia-driver-daemonset-ck684 -n gpu-operator-resources
Creating directory NVIDIA-Linux-x86_64-460.73.01
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.73.01...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that an NVIDIA kernel module matching this driver version is installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 5.8.0-55-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
E: Failed to fetch https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/ce4d38aa740e318d2eae04cba08f1322017d162183c8f61f84391bf88020a534 404 Not Found [IP: 180.101.196.129 443]
E: Some index files failed to download. They have been ignored, or old ones used instead.
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
It seems an error occurred to fetch package:
E: Failed to fetch https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/ce4d38aa740e318d2eae04cba08f1322017d162183c8f61f84391bf88020a534 404 Not Found [IP: 180.101.196.129 443]
E: Some index files failed to download. They have been ignored, or old ones used instead.
I check the helm package, the latest v1.7.0 installed:
microk8s.helm3 ls
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /var/snap/microk8s/2262/credentials/client.config
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator default 1 2021-06-22 06:38:42.691611216 +0800 CST deployed gpu-operator-v1.7.0 v1.7.0
2. Steps to reproduce the issue
On Ubuntu 20.04.2 with a nvidia GPU:
sudo snap install microk8s --channel 1.21/stable --classic
microk8s enable gpu
An update: the POD nvidia-driver-daemonset
will not be created after I install the GPU driver on the host machine, and all related PODs work well:
kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator-resources gpu-feature-discovery-8jxrm 1/1 Running 2 12h
gpu-operator-resources nvidia-dcgm-exporter-886qt 1/1 Running 2 12h
gpu-operator-resources nvidia-device-plugin-daemonset-qxk7z 1/1 Running 2 12h
gpu-operator-resources nvidia-container-toolkit-daemonset-9fxjl 1/1 Running 2 12h
kube-system coredns-7f9c69c78c-qxw9j 1/1 Running 11 24h
default gpu-operator-7db468cfdf-ghrfd 1/1 Running 2 12h
default gpu-operator-node-feature-discovery-master-867c4f7bfb-9mv2m 1/1 Running 2 12h
kube-system calico-kube-controllers-8695b994-jlhfd 1/1 Running 3 13h
gpu-operator-resources nvidia-cuda-validator-lv2t7 0/1 Completed 0 80m
gpu-operator-resources nvidia-device-plugin-validator-svl2c 0/1 Completed 0 80m
kube-system calico-node-h5rwh 1/1 Running 3 13h
gpu-operator-resources nvidia-operator-validator-khdjk 1/1 Running 2 12h
default gpu-operator-node-feature-discovery-worker-rhcgj 1/1 Running 3 12h
kube-system hostpath-provisioner-5c65fbdb4f-66dpn 1/1 Running 0 79m
kube-system metrics-server-8bbfb4bdb-88j8s 1/1 Running 0 79m
I checked the script of enabling gpu-operator
in Microk8s, which will set the argument driver.enabled=true
when the GPU driver installed on the host machine.
The current issue focus on the situation of installing the GPU driver by gpu-operator
, the POD nvidia-driver-daemonset
always fails on Ubuntu 20.04.2
Interesting, the error is not relevant to kernel version. The base image used to build this container is nvidia/cuda:11.3.0-base-ubuntu20.04
and the first command run(w.r.t packages) is apt-get update
. Same works fine with AWS Ubuntu20.04 kernels(eg: 5.4.0-1048-aws). I will check with CUDA team here to verify if this was seen.
https://github.com/ubuntu/microk8s/issues/2763#issuecomment-999778587
Installing microk8s v1.22 worked. It looks like an issue with microk8s: https://github.com/canonical/microk8s/issues/2634