gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NVIDIA Driver Could not resolve Linux kernel version on CentOS 7.9 Kernel 5.4.

Open pohsien324 opened this issue 4 years ago • 1 comments

1. Quick Debug Checklist

  • [X] Are you running on an Ubuntu 18.04 node? => No, CentOS 7.9.
  • [X] Are you running Kubernetes v1.13+? => Yes, Kubernets v1.19.9.
  • [X] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? => CRIO 1.19.1.
  • [X] Do you have i2c_core and ipmi_msghandler loaded on the nodes? => No
  • [X] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces) => Yes

1. Issue or feature description

Cluster Information:

  • CentOS 7.9, Kernel Version: 5.4.124-1.el7.elrepo.x86_64
  • Kubernetes v1.19.9 with CRI-O 1.19.1
  • GPU Operator 1.6.2
  • NVIDIA Driver Image: nvcr.io/nvidia/driver:460.32.03-centos7

Last week, I used to install GPU operator 6.2 on Kubernetes v1.19.9 (CentOS 7.9 Kernel 3.10.0-1160.15.2.el7.x86_64), and everything is fine. But after upgrading the CentOS 7 kernel from 3.10.0 to 5.4, the NVIDIA Driver Pod displays the following error message. The Kernel Version cannot be resolved and the related Kernel package cannot be found.

$ kubectl get pods -n gpu-operator-resources

NAME                                       READY   STATUS             RESTARTS   AGE
nvidia-container-toolkit-daemonset-4czx9   0/1     Init:0/1           0          37m
nvidia-driver-daemonset-cs2sr              0/1     CrashLoopBackOff   14         37m
$ kubectl logs nvidia-driver-daemonset-cs2sr -n gpu-operator-resources

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.4.124-1.el7.elrepo.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Unable to open the file '/lib/modules/5.4.124-1.el7.elrepo.x86_64/proc/version' (No such file or directory).Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

I use ELRepo.org to update my CentOS Kernel, does it seem that NVIDIA Driver Image does not support ELRepo? (or Linux Kernel 5.x?)

2. Steps to reproduce the issue

  1. Install Kubernetes v1.19.9 with CRI-O 1.19.1
  2. Upgrde CentOS 7 kernel from 3.10.0-1160.15.2.el7.x86_64 to 5.4.124-1.el7.elrepo.x86_64 (Use ELRepo )
  3. Deploy GPU Operator 1.6.2
$ helm install --wait --generate-name ./gpu-operator --set operator.defaultRuntime=crio --set toolkit.version=1.4.7-ubi8

3. Information to attach (optional if deemed irrelevant)

  1. kubernetes pods status.
$ kubectl get pods --all-namespaces

default                  gpu-operator-1623131323-node-feature-discovery-master-6685shjp7   1/1     Running            0          28m
default                  gpu-operator-1623131323-node-feature-discovery-worker-cdvpj       1/1     Running            1          28m
default                  gpu-operator-1623131323-node-feature-discovery-worker-k9fpf       1/1     Running            1          28m
default                  gpu-operator-1623131323-node-feature-discovery-worker-kwdsb       1/1     Running            2          28m
default                  gpu-operator-1623131323-node-feature-discovery-worker-tldwn       1/1     Running            0          28m
default                  gpu-operator-65d474cc8-rtwdq                                      1/1     Running            0          28m
gpu-operator-resources   nvidia-container-toolkit-daemonset-4czx9                          0/1     Init:0/1           0          28m
gpu-operator-resources   nvidia-driver-daemonset-cs2sr                                     0/1     CrashLoopBackOff   12         28m
kube-system              cilium-42d9z                                                      1/1     Running            0          29m
kube-system              cilium-mhsdn                                                      1/1     Running            0          29m
kube-system              cilium-operator-694449c44b-n2pxm                                  1/1     Running            5          26h
kube-system              cilium-r6fkq                                                      1/1     Running            0          29m
kube-system              cilium-sft2q                                                      1/1     Running            0          29m
kube-system              coredns-7677f9bb54-dx4st                                          1/1     Running            2          25h
kube-system              coredns-7677f9bb54-r9h2p                                          1/1     Running            3          25h
kube-system              dns-autoscaler-5b7b5c9b6f-99t9s                                   1/1     Running            2          25h
kube-system              etcd-k8s-master1.k8s.lab                                          1/1     Running            2          26h
kube-system              kube-apiserver-k8s-master1.k8s.lab                                1/1     Running            2          26h
kube-system              kube-controller-manager-k8s-master1.k8s.lab                       1/1     Running            3          26h
kube-system              kube-proxy-9k5cv                                                  1/1     Running            2          26h
kube-system              kube-proxy-hw9rx                                                  1/1     Running            2          26h
kube-system              kube-proxy-pz5sq                                                  1/1     Running            2          26h
kube-system              kube-proxy-xq8k2                                                  1/1     Running            4          26h
kube-system              kube-scheduler-k8s-master1.k8s.lab                                1/1     Running            4          26h
kube-system              metrics-server-747c56cf5f-qv5vv                                   2/2     Running            4          25h
kube-system              nodelocaldns-gzrxj                                                1/1     Running            2          25h
kube-system              nodelocaldns-hqg47                                                1/1     Running            2          25h
kube-system              nodelocaldns-wc79d                                                1/1     Running            4          25h
kube-system              nodelocaldns-zvl9t                                                1/1     Running            2          25h
  1. NVIDIA Driver DaemonSet Log.
$ kubectl logs nvidia-driver-daemonset-cs2sr -n gpu-operator-resources

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.4.124-1.el7.elrepo.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Unable to open the file '/lib/modules/5.4.124-1.el7.elrepo.x86_64/proc/version' (No such file or directory).Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

pohsien324 avatar Jun 08 '21 06:06 pohsien324

+1

daniel-hutao avatar Jun 16 '21 06:06 daniel-hutao

+1

ldd91 avatar May 23 '23 06:05 ldd91