Centos 7. nvidia-driver pod "Could not resolve Linux kernel version"
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node? No. Centos 7.8
- [x] Are you running Kubernetes v1.13+? v1.18
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Docker 20.10.3
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [ ] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
I've error while nvidia-driver pod try to install driver on centos 7. this log is
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 3.10.0-862.el7.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
I see it same error in #97. but i try to disable nouveau with following it not resolve. I've used gpu-operator v1.5.2. Please help me resolve this error. thanks.
Having the same problem. happens also on gpu-operator v1.6, please help 🙏
to better understand what script is running, could you tell us which image the driver Pod is running
kubectl get pods -n gpu-operator-resources -o=jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.image}{", "}{end}{end}' | grep nvidia-driver-daemonset
I guess it should be executing this script: https://gitlab.com/nvidia/container-images/driver/-/blob/master/centos7/nvidia-driver
echo "Resolving Linux kernel version..."
if [ -z "${version}" ]; then
echo "Could not resolve Linux kernel version" >&2
return 1
fi
but the error message doesn't says so:
Resolving Linux kernel version...
Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory).
Could not resolve Linux kernel version
hey @kpouget , the driver's image is nvidia-driver-daemonset-5tzbm: nvcr.io/nvidia/driver:460.32.03-centos7,
Hi,
I have the exact same issue,
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 4.19.95-1.bplatform.el7.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Unable to open the file '/lib/modules/4.19.95-1.bplatform.el7.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
the version of driver I am using is also exactly the one @sahare92 is using,
nvidia-driver-daemonset-5b4mp: nvcr.io/nvidia/driver:460.32.03-centos7,
Would greatly appreciate help.
@hassanshabbirahmed @vietkute02 will debug the issue with CentOS7. Meanwhile can you edit the driver daemonset to edit the image to nvcr.io/nvidia/driver:450.80.02-rhel7.9 and verify if this resolves it?
same issue. Happens on CentOS8
./helm install --wait --generate-name nvidia/gpu-operator --debug --set driver.version="450.80.02"
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 4.18.0-240.el8.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Unable to open the file '/lib/modules/4.18.0-240.el8.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version. You likely have a mismatch between your running kernel and the kernel-headers on the repo. Please upgrade your Linux kernel to at least 4.18.0-240.el8.x86_64.
On my server, I run kernel 4.18.0 offline.
But it seems that the code checks kernel version from the online repo, and make the numbers mismatch
local version=$(dnf -q list available --showduplicates kernel-headers |
awk -v arch=$(uname -m) 'NR>1 {print $2"."arch}' | tac | grep -E -m1 "^${KERNEL_VERSION/latest/.*}")
to better understand what script is running, could you tell us which image the driver Pod is running
kubectl get pods -n gpu-operator-resources -o=jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.image}{", "}{end}{end}' | grep nvidia-driver-daemonsetI guess it should be executing this script: https://gitlab.com/nvidia/container-images/driver/-/blob/master/centos7/nvidia-driver
echo "Resolving Linux kernel version..." if [ -z "${version}" ]; then echo "Could not resolve Linux kernel version" >&2 return 1 fibut the error message doesn't says so:
Resolving Linux kernel version... Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory). Could not resolve Linux kernel version
My cluster runs on Centos 7.6 with upgraded kernel 4.19.12-1.el7,
# rpm -qa | grep kernel-ml
kernel-ml-4.19.12-1.el7.elrepo.x86_64
replace kernel to kernel-ml in nvidia-docker and re-build the image, by using modified image nvcr.io/nvidia/mldriver:460.32.03-centos7 I could get nvidia-driver-daemonset working.
# docker build -t nvcr.io/nvidia/mldriver:460.32.03-centos7 .
# cat Dockerfile
FROM nvcr.io/nvidia/driver:460.32.03-centos7
COPY nvidia-driver /usr/local/bin
# diff nvidia-driver nvidia-driver.orig
27c27
< local version=$(yum -q list available --show-duplicates kernel-ml-headers |
---
> local version=$(yum -q list available --show-duplicates kernel-headers |
50,52c50,51
< echo "Installing Linux kernel ml headers..."
< rpm -e --nodeps kernel-headers
< yum -q -y install kernel-ml-headers-${KERNEL_VERSION} kernel-ml-devel-${KERNEL_VERSION} > /dev/null
---
> echo "Installing Linux kernel headers..."
> yum -q -y install kernel-headers-${KERNEL_VERSION} kernel-devel-${KERNEL_VERSION} > /dev/null
56c55
< curl -fsSL $(repoquery --location kernel-ml-${KERNEL_VERSION}) | rpm2cpio | cpio -idm --quiet
---
> curl -fsSL $(repoquery --location kernel-${KERNEL_VERSION}) | rpm2cpio | cpio -idm --quiet
390a390
>
After building image you have to manually replace tags in values.yml.
#205
happens also on gpu-operator v1.7.0
Seeing this as well with centos8, containerd as opposed to docker too.
Hi folks, I've got round this by dnf upgrade -y as looking through the script it tries to match the hosts kernel version and the version in the repo. I also had to add the repoConfig for the driver as the environment has no internet access.
Strays from this issue but I now get compilation errors.
We are trying to validate this internally and will try out to fix this soon.
Hi everyone,
This issue is not a bug with the nvidia driver-container. The driver-container requires that the kernel-headers for the running kernel are present and can be accessed by the package manager (i.e. yum, dnf) inside of the driver-container. The issue is, if you are running a kernel that is slightly out of date, meaning it is the not the latest kernel version, the package manager will probably not be able to access the right kernel-headers by default and therefore the driver-container will fail like above.
To avoid this issue, either 1) Upgrade your running kernel, or 2) Provide a custom repo configuration file for the driver container by configuring the driver.repoConfig option when deploying the gpu-operator. This solution isn't documented yet, but we will be documenting these steps soon on our official docs.
Hi, I am having the same problem using Centos 7 with the latest kernel: Linux 5.13.11-1.el7.elrepo.x86_64 x86_64 We tried @purplepalmdash solution but it did not work for us. Can you provide an example of a custom repo configuration file?
Thanks in advance
Hi, I'm having the same issue using Centos 8 with kernel: kernel-4.18.0-305.25.1.el8_4.x86_64.
rpm -qa | grep kernel-
kernel-tools-libs-4.18.0-305.25.1.el8_4.x86_64
kernel-headers-4.18.0-305.25.1.el8_4.x86_64
kernel-core-4.18.0-305.25.1.el8_4.x86_64
kernel-modules-4.18.0-305.25.1.el8_4.x86_64
kernel-tools-4.18.0-305.25.1.el8_4.x86_64
kernel-4.18.0-305.25.1.el8_4.x86_64
kernel-devel-4.18.0-305.25.1.el8_4.x86_64
kubectl logs nvidia-driver-daemonset-k7bxp -n gpu-operator-resources
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 470.57.02 for Linux kernel version 4.18.0-305.25.1.el8_4.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version. You likely have a mismatch between your running kernel and the kernel-headers on the repo. Please upgrade your Linux kernel to at least 4.18.0-305.25.1.el8_4.x86_64.
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Any idea ? Thanks
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node? No. Centos 7.8
- [x] Are you running Kubernetes v1.13+? v1.18
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Docker 20.10.3
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes?- [ ] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)1. Issue or feature description
I've error while nvidia-driver pod try to install driver on centos 7. this log is
========== NVIDIA Software Installer ========== Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 3.10.0-862.el7.x86_64 Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache... Resolving Linux kernel version... Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...I see it same error in #97. but i try to disable nouveau with following it not resolve. I've used gpu-operator v1.5.2. Please help me resolve this error. thanks.
I am on centos7.8 system, kernel version 3.10.0-1127, and replace the image in nvidia-driver-daemonset with nvidia/driver:440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7, Now all my pods are running fine
Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (440.64.00): Installing: [##############################] 100% Driver file installation is complete. Running post-install sanity check: Checking: [##############################] 100% Post-install sanity check passed. Running runtime sanity check: Checking: [##############################] 100% Runtime sanity check passed.
Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 440.64.00) is now complete.
Loading IPMI kernel module... Loading NVIDIA driver kernel modules... Starting NVIDIA persistence daemon... Mounting NVIDIA driver rootfs... Done, now waiting for signal
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/install-gpu-operator-outdated-kernels.html