gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

GPU drivers not installing with host kernel 6.8 and vGPU 16.5 (535.161.05)

Open urbaman opened this issue 1 year ago • 7 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 5.15.0-106-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd 1.6.28
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Kubeadm
  • GPU Operator Version: 23.9.2

2. Issue or feature description

Driver installation fails in VM on kernel 6.8 Host, vGPU driver 16.5, 535.161.05

3. Steps to reproduce the issue

Install vGPU 16.5, 535.161.05 on the host, then try gpu-operator

4. Information to attach (optional if deemed irrelevant)

nvidia-driver-daemonset-k59mv logs:

Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.15.0-106-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  You are using:           cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
   83 | }
      | ^
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  695 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |     ^~~~~~
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-peermem/nvidia-peermem.c:490:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  490 |     int status = 0;
      |     ^~~
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
make[2]: *** [scripts/Makefile.modpost:133: /usr/src/nvidia-535.129.03-grid/kernel/Module.symvers] Error 1
make[2]: *** Deleting file '/usr/src/nvidia-535.129.03-grid/kernel/Module.symvers'
make[1]: *** [Makefile:1830: modules] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

urbaman avatar May 13 '24 12:05 urbaman

A similar issue for me also. attaching the crash report

Azure VM: Linux 5.15.0-1063-azure x86_64 NAME="Ubuntu" VERSION="20.04.6 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.6 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

nvidia-dkms-515.0.crash.txt

kollachaitanyakrishna avatar May 13 '24 15:05 kollachaitanyakrishna

I also encounter the same problem on Ubuntu 20.04, nvidia-driver-535.171.04, kernel 5.15.0-107-generic

bqm1111 avatar May 16 '24 13:05 bqm1111

Appears to be a known issue for kernel upgrades. The current/stable nvidia driver version 550.x works fine.

vicaya avatar May 18 '24 00:05 vicaya

Appears to be a known issue for kernel upgrades. The current/stable nvidia driver version 550.x works fine.

How can I install nvidia-driver-550 on ubuntu 20.04?

bqm1111 avatar May 19 '24 20:05 bqm1111

Appears to be a known issue for kernel upgrades. The current/stable nvidia driver version 550.x works fine.

How can I install nvidia-driver-550 on ubuntu 20.04?

Hi Did you solve your problem? same with yours :(

Stephenfang51 avatar Jun 06 '24 05:06 Stephenfang51

Hi

You have to manually download the driver from this site.

bqm1111 avatar Jun 06 '24 09:06 bqm1111

Hi

You have to manually download the driver from this site.

Manually install works for me!

2019211753 avatar Jun 14 '24 15:06 2019211753

The following error was fixed in the 535.183.08 driver

ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'

Closing this issue.

cdesiniotis avatar Jul 11 '24 21:07 cdesiniotis