gpu-operator
gpu-operator copied to clipboard
NVIDIA Driver Could not resolve Linux kernel version on CentOS 7.9 Kernel 5.4.
1. Quick Debug Checklist
- [X] Are you running on an Ubuntu 18.04 node? => No, CentOS 7.9.
- [X] Are you running Kubernetes v1.13+? => Yes, Kubernets v1.19.9.
- [X] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? => CRIO 1.19.1.
- [X] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? => No - [X] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces) => Yes
1. Issue or feature description
Cluster Information:
- CentOS 7.9, Kernel Version:
5.4.124-1.el7.elrepo.x86_64 - Kubernetes v1.19.9 with CRI-O 1.19.1
- GPU Operator 1.6.2
- NVIDIA Driver Image:
nvcr.io/nvidia/driver:460.32.03-centos7
Last week, I used to install GPU operator 6.2 on Kubernetes v1.19.9 (CentOS 7.9 Kernel 3.10.0-1160.15.2.el7.x86_64), and everything is fine. But after upgrading the CentOS 7 kernel from 3.10.0 to 5.4, the NVIDIA Driver Pod displays the following error message. The Kernel Version cannot be resolved and the related Kernel package cannot be found.
$ kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-container-toolkit-daemonset-4czx9 0/1 Init:0/1 0 37m
nvidia-driver-daemonset-cs2sr 0/1 CrashLoopBackOff 14 37m
$ kubectl logs nvidia-driver-daemonset-cs2sr -n gpu-operator-resources
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.4.124-1.el7.elrepo.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Unable to open the file '/lib/modules/5.4.124-1.el7.elrepo.x86_64/proc/version' (No such file or directory).Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
I use ELRepo.org to update my CentOS Kernel, does it seem that NVIDIA Driver Image does not support ELRepo? (or Linux Kernel 5.x?)
2. Steps to reproduce the issue
- Install Kubernetes v1.19.9 with CRI-O 1.19.1
- Upgrde CentOS 7 kernel from
3.10.0-1160.15.2.el7.x86_64to5.4.124-1.el7.elrepo.x86_64(Use ELRepo ) - Deploy GPU Operator 1.6.2
$ helm install --wait --generate-name ./gpu-operator --set operator.defaultRuntime=crio --set toolkit.version=1.4.7-ubi8
3. Information to attach (optional if deemed irrelevant)
- kubernetes pods status.
$ kubectl get pods --all-namespaces
default gpu-operator-1623131323-node-feature-discovery-master-6685shjp7 1/1 Running 0 28m
default gpu-operator-1623131323-node-feature-discovery-worker-cdvpj 1/1 Running 1 28m
default gpu-operator-1623131323-node-feature-discovery-worker-k9fpf 1/1 Running 1 28m
default gpu-operator-1623131323-node-feature-discovery-worker-kwdsb 1/1 Running 2 28m
default gpu-operator-1623131323-node-feature-discovery-worker-tldwn 1/1 Running 0 28m
default gpu-operator-65d474cc8-rtwdq 1/1 Running 0 28m
gpu-operator-resources nvidia-container-toolkit-daemonset-4czx9 0/1 Init:0/1 0 28m
gpu-operator-resources nvidia-driver-daemonset-cs2sr 0/1 CrashLoopBackOff 12 28m
kube-system cilium-42d9z 1/1 Running 0 29m
kube-system cilium-mhsdn 1/1 Running 0 29m
kube-system cilium-operator-694449c44b-n2pxm 1/1 Running 5 26h
kube-system cilium-r6fkq 1/1 Running 0 29m
kube-system cilium-sft2q 1/1 Running 0 29m
kube-system coredns-7677f9bb54-dx4st 1/1 Running 2 25h
kube-system coredns-7677f9bb54-r9h2p 1/1 Running 3 25h
kube-system dns-autoscaler-5b7b5c9b6f-99t9s 1/1 Running 2 25h
kube-system etcd-k8s-master1.k8s.lab 1/1 Running 2 26h
kube-system kube-apiserver-k8s-master1.k8s.lab 1/1 Running 2 26h
kube-system kube-controller-manager-k8s-master1.k8s.lab 1/1 Running 3 26h
kube-system kube-proxy-9k5cv 1/1 Running 2 26h
kube-system kube-proxy-hw9rx 1/1 Running 2 26h
kube-system kube-proxy-pz5sq 1/1 Running 2 26h
kube-system kube-proxy-xq8k2 1/1 Running 4 26h
kube-system kube-scheduler-k8s-master1.k8s.lab 1/1 Running 4 26h
kube-system metrics-server-747c56cf5f-qv5vv 2/2 Running 4 25h
kube-system nodelocaldns-gzrxj 1/1 Running 2 25h
kube-system nodelocaldns-hqg47 1/1 Running 2 25h
kube-system nodelocaldns-wc79d 1/1 Running 4 25h
kube-system nodelocaldns-zvl9t 1/1 Running 2 25h
- NVIDIA Driver DaemonSet Log.
$ kubectl logs nvidia-driver-daemonset-cs2sr -n gpu-operator-resources
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.4.124-1.el7.elrepo.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Unable to open the file '/lib/modules/5.4.124-1.el7.elrepo.x86_64/proc/version' (No such file or directory).Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
+1
+1