gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

AKS node with deallocated, gpu drivers can't be installed anymore after ~10 restarts

Open Johannesm299 opened this issue 1 year ago • 1 comments

1. Quick Debug Information

Cloud Provider: Azure AKS OS: Ubuntu 22.04.4 LTS Kernel Version: 5.15.0-1064-azure Container Runtime: Containerd K8s: AKS GPU Operator Version: v23.9.0

additional info: the scale down mode for the gpu nodes is deallocate, which is an azure specific setting that keeps the disk and uses the same disk for the next time the node is scaled up.

2. Issue or feature description

The operator works fine initially, but after around 10 or so restarts of the gpu node the operator can't install the drivers anymore. This seems to be a problem with the deallocated setting in the aks node pool, without that setting the problem doesn't occur.

3. Steps to reproduce the issue

  • Create AKS Cluster with gpu node pool that has scale down mode "deallocated"
  • install gpu operater via helm with the following values
operator:
  defaultRuntime: containerd
  logging:
    level: debug

dcgmExporter:
  enabled: false

driver:
  enabled: true
  version: "535.54.03"

toolkit:
  enabled: true
  • have the gpu node scale up and down until the gpu-operator can't install the drivers anymore

nvidia-driver-daemonset pod debug log

Creating directory NVIDIA-Linux-x86_64-535.54.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.54.03......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 535.54.03 for Linux kernel version 5.15.0-1064-azure

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.15.0-1064-azure
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  You are using:           cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
/usr/src/nvidia-535.54.03/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
/usr/src/nvidia-535.54.03/kernel/nvidia-peermem/nvidia-peermem.c:462:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  462 |     int status = 0;
      |     ^~~
/usr/src/nvidia-535.54.03/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':
/usr/src/nvidia-535.54.03/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  695 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |     ^~~~~~
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
make[2]: *** [scripts/Makefile.modpost:133: /usr/src/nvidia-535.54.03/kernel/Module.symvers] Error 1
make[2]: *** Deleting file '/usr/src/nvidia-535.54.03/kernel/Module.symvers'
make[1]: *** [Makefile:1830: modules] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Stopping NVIDIA persistence daemon...```

Johannesm299 avatar Jun 03 '24 08:06 Johannesm299

We are also facing exact same issue. Is there any solution / workaround known to anyone yet ? Here is another similar issue: https://github.com/NVIDIA/gpu-operator/issues/718 , it seems for some people after the driver upgrade it started to work, but that will make the CUDA version to be upgraded, which we do not want at the moment

UPDATE: after upgrading the gpu-operator driver version to 535.183.01 the driver daemonset pod runs fine.

sanketnadkarni avatar Jul 01 '24 09:07 sanketnadkarni

The error ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' is a known issue on newer kernels. This issue should be fixed with driver versions >= 535.183.08. Closing this issue.

cdesiniotis avatar Jul 11 '24 23:07 cdesiniotis