gpu-operator Centos 7. nvidia-driver pod "Could not resolve Linux kernel version"

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node? No. Centos 7.8
[x] Are you running Kubernetes v1.13+? v1.18
[x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Docker 20.10.3
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I've error while nvidia-driver pod try to install driver on centos 7. this log is

========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 3.10.0-862.el7.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

I see it same error in #97. but i try to disable nouveau with following it not resolve. I've used gpu-operator v1.5.2. Please help me resolve this error. thanks.

Feb 23 '21 07:02 vietanha34

Having the same problem. happens also on gpu-operator v1.6, please help 🙏

Mar 03 '21 13:03 sahare92

to better understand what script is running, could you tell us which image the driver Pod is running

kubectl get pods -n gpu-operator-resources -o=jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.image}{", "}{end}{end}' | grep nvidia-driver-daemonset

I guess it should be executing this script: https://gitlab.com/nvidia/container-images/driver/-/blob/master/centos7/nvidia-driver

    echo "Resolving Linux kernel version..."
    if [ -z "${version}" ]; then
        echo "Could not resolve Linux kernel version" >&2
        return 1
    fi

but the error message doesn't says so:

Resolving Linux kernel version...
Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory).
Could not resolve Linux kernel version

Mar 03 '21 14:03 kpouget

hey @kpouget , the driver's image is nvidia-driver-daemonset-5tzbm: nvcr.io/nvidia/driver:460.32.03-centos7,

Mar 03 '21 14:03 sahare92

Hi,

I have the exact same issue,

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 4.19.95-1.bplatform.el7.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Unable to open the file '/lib/modules/4.19.95-1.bplatform.el7.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

the version of driver I am using is also exactly the one @sahare92 is using,

nvidia-driver-daemonset-5b4mp: nvcr.io/nvidia/driver:460.32.03-centos7,

Would greatly appreciate help.

Mar 22 '21 18:03 hassanshabbirahmed

@hassanshabbirahmed @vietkute02 will debug the issue with CentOS7. Meanwhile can you edit the driver daemonset to edit the image to nvcr.io/nvidia/driver:450.80.02-rhel7.9 and verify if this resolves it?

Apr 05 '21 16:04 shivamerla

same issue. Happens on CentOS8

./helm install --wait --generate-name nvidia/gpu-operator --debug --set driver.version="450.80.02"

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 4.18.0-240.el8.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Unable to open the file '/lib/modules/4.18.0-240.el8.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version. You likely have a mismatch between your running kernel and the kernel-headers on the repo. Please upgrade your Linux kernel to at least 4.18.0-240.el8.x86_64.

Apr 14 '21 14:04 dajiji

On my server, I run kernel 4.18.0 offline.

But it seems that the code checks kernel version from the online repo, and make the numbers mismatch

local version=$(dnf -q list available --showduplicates kernel-headers |
      awk -v arch=$(uname -m) 'NR>1 {print $2"."arch}' | tac | grep -E -m1 "^${KERNEL_VERSION/latest/.*}")

Apr 14 '21 15:04 dajiji

to better understand what script is running, could you tell us which image the driver Pod is running
kubectl get pods -n gpu-operator-resources -o=jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.image}{", "}{end}{end}' | grep nvidia-driver-daemonset
I guess it should be executing this script: https://gitlab.com/nvidia/container-images/driver/-/blob/master/centos7/nvidia-driver
    echo "Resolving Linux kernel version..."
    if [ -z "${version}" ]; then
        echo "Could not resolve Linux kernel version" >&2
        return 1
    fi
but the error message doesn't says so:
Resolving Linux kernel version...
Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory).
Could not resolve Linux kernel version

My cluster runs on Centos 7.6 with upgraded kernel 4.19.12-1.el7,

# rpm -qa | grep kernel-ml
kernel-ml-4.19.12-1.el7.elrepo.x86_64

replace kernel to kernel-ml in nvidia-docker and re-build the image, by using modified image nvcr.io/nvidia/mldriver:460.32.03-centos7 I could get nvidia-driver-daemonset working.

# docker build -t  nvcr.io/nvidia/mldriver:460.32.03-centos7 .
# cat Dockerfile 
FROM nvcr.io/nvidia/driver:460.32.03-centos7
COPY nvidia-driver /usr/local/bin
# diff nvidia-driver nvidia-driver.orig 
27c27
<     local version=$(yum -q list available --show-duplicates kernel-ml-headers |
---
>     local version=$(yum -q list available --show-duplicates kernel-headers |
50,52c50,51
<     echo "Installing Linux kernel ml headers..."
<     rpm -e --nodeps kernel-headers
<     yum -q -y install kernel-ml-headers-${KERNEL_VERSION} kernel-ml-devel-${KERNEL_VERSION} > /dev/null
---
>     echo "Installing Linux kernel headers..."
>     yum -q -y install kernel-headers-${KERNEL_VERSION} kernel-devel-${KERNEL_VERSION} > /dev/null
56c55
<     curl -fsSL $(repoquery --location kernel-ml-${KERNEL_VERSION}) | rpm2cpio | cpio -idm --quiet
---
>     curl -fsSL $(repoquery --location kernel-${KERNEL_VERSION}) | rpm2cpio | cpio -idm --quiet
390a390
>

After building image you have to manually replace tags in values.yml.

Apr 20 '21 00:04 purplepalmdash

#205

Jun 16 '21 06:06 daniel-hutao

happens also on gpu-operator v1.7.0

Jun 16 '21 06:06 daniel-hutao

Seeing this as well with centos8, containerd as opposed to docker too.

Jun 17 '21 12:06 rhysjtevans

Hi folks, I've got round this by dnf upgrade -y as looking through the script it tries to match the hosts kernel version and the version in the repo. I also had to add the repoConfig for the driver as the environment has no internet access.

Strays from this issue but I now get compilation errors.

Jun 17 '21 14:06 rhysjtevans

We are trying to validate this internally and will try out to fix this soon.

Jul 26 '21 23:07 shivamerla

Hi everyone,

This issue is not a bug with the nvidia driver-container. The driver-container requires that the kernel-headers for the running kernel are present and can be accessed by the package manager (i.e. yum, dnf) inside of the driver-container. The issue is, if you are running a kernel that is slightly out of date, meaning it is the not the latest kernel version, the package manager will probably not be able to access the right kernel-headers by default and therefore the driver-container will fail like above.

To avoid this issue, either 1) Upgrade your running kernel, or 2) Provide a custom repo configuration file for the driver container by configuring the driver.repoConfig option when deploying the gpu-operator. This solution isn't documented yet, but we will be documenting these steps soon on our official docs.

Jul 30 '21 00:07 cdesiniotis

Hi, I am having the same problem using Centos 7 with the latest kernel: Linux 5.13.11-1.el7.elrepo.x86_64 x86_64 We tried @purplepalmdash solution but it did not work for us. Can you provide an example of a custom repo configuration file?

Thanks in advance

Aug 19 '21 08:08 Fede112

Hi, I'm having the same issue using Centos 8 with kernel: kernel-4.18.0-305.25.1.el8_4.x86_64.

rpm -qa | grep kernel-
kernel-tools-libs-4.18.0-305.25.1.el8_4.x86_64
kernel-headers-4.18.0-305.25.1.el8_4.x86_64
kernel-core-4.18.0-305.25.1.el8_4.x86_64
kernel-modules-4.18.0-305.25.1.el8_4.x86_64
kernel-tools-4.18.0-305.25.1.el8_4.x86_64
kernel-4.18.0-305.25.1.el8_4.x86_64
kernel-devel-4.18.0-305.25.1.el8_4.x86_64

kubectl logs nvidia-driver-daemonset-k7bxp -n gpu-operator-resources

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 470.57.02 for Linux kernel version 4.18.0-305.25.1.el8_4.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version. You likely have a mismatch between your running kernel and the kernel-headers on the repo. Please upgrade your Linux kernel to at least 4.18.0-305.25.1.el8_4.x86_64.
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Any idea ? Thanks

Nov 26 '21 16:11 adaouda

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node? No. Centos 7.8

[x] Are you running Kubernetes v1.13+? v1.18

[x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Docker 20.10.3

[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?

[ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I've error while nvidia-driver pod try to install driver on centos 7. this log is
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 3.10.0-862.el7.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
I see it same error in #97. but i try to disable nouveau with following it not resolve. I've used gpu-operator v1.5.2. Please help me resolve this error. thanks.

I am on centos7.8 system, kernel version 3.10.0-1127, and replace the image in nvidia-driver-daemonset with nvidia/driver:440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7, Now all my pods are running fine

Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (440.64.00): Installing: [##############################] 100% Driver file installation is complete. Running post-install sanity check: Checking: [##############################] 100% Post-install sanity check passed. Running runtime sanity check: Checking: [##############################] 100% Runtime sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 440.64.00) is now complete.

Loading IPMI kernel module... Loading NVIDIA driver kernel modules... Starting NVIDIA persistence daemon... Mounting NVIDIA driver rootfs... Done, now waiting for signal

Jul 26 '22 00:07 liujunfei980

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/install-gpu-operator-outdated-kernels.html

May 15 '23 08:05 pikomen