gpu-operator Toolkit DaemonSet stuck in init phase after upgrade

Symptoms

After the upgrade to v23.6.0 of the operator all deployments are stuck because the nvidia-container-toolkit-daemonset DaemonSet stays in the Init:0/1 phase indefinitely. The logs of the init stage indicate that the driver-validation container is trying to execute modprobe on the nvidia kmod which fails because the container does not ship with the kernel modules (they are loaded by the driver daemonset).

Issue

I believe the issue is cause by the fixes to #430 / #485 introduced in 84ef9b3. The new module for creating the char devices also has a function to load the kernel modules. With 8906259 (ping @elezar) this is explicitly activated here

https://github.com/NVIDIA/gpu-operator/blob/25d6f8d0a6f12b020b0bbc4d60ab95ccab77f5d0/validator/main.go#L717

As mentioned above I believe that this can't work in the validation container unless I am missing something and I also do not understand why it would be necessary since the loading of the modules is handled by the driver daemonset.

Proposed solution

I think the code above in the validators main.go should be set to devchar.WithLoadKernelModules(false). We have deployed a copy of the v23.6.0 container with this patch in our environment and everything seems to work fine.

diff --git a/validator/main.go b/validator/main.go
index 4742834d..9fee8103 100644
--- a/validator/main.go
+++ b/validator/main.go
@@ -714,7 +714,7 @@ func createDevCharSymlinks(isHostDriver bool, driverRoot string) error {
                devchar.WithDevCharPath(hostDevCharPath),
                devchar.WithCreateAll(true),
                devchar.WithCreateDeviceNodes(true),
-               devchar.WithLoadKernelModules(true),
+               devchar.WithLoadKernelModules(false),
        )

time="2023-08-09T20:58:41Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2023-08-09T20:58:41Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 1; output=modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.122-flatcar\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

Aug 09 '23 20:08 heilerich

we set the driverRoot to either /host or /run/nvidia/driver which the nvidia-ctk library should chroot to and find nvidia modules, wondering why it is not able to find them.

@heilerich this was added for cases with pre-installed driver, where all necessary modules might not be loaded when a GPU pod is scheduled or during node reboot. With driver-container, we can skip this.

@elezar yes, it makes sense to add an option to disable this.

Aug 10 '23 02:08 shivamerla

@heilerich just to confirm in your case driver is pre-installed on the node with Flatcar Linux?

Aug 10 '23 02:08 shivamerla

@heilerich just to confirm in your case driver is pre-installed on the node with Flatcar Linux?

No the driver is installed using the driver-container

we set the driverRoot to either /host or /run/nvidia/driver which the nvidia-ctk library should chroot to and find nvidia modules, wondering why it is not able to find them.

Ah, I totally missed the chroot. That must be it. The flatcar driver-container does not copy the modules to /lib/modules in its root filesystem, but loads it from /opt/nvidia/${DRIVER_VERSION} using modprobe -b. That's why the modprobe fails even when chrooting into the driver root. So bascially this is a mismatch between the expecatation of how the (maintained) driver containers should look like and how the flatcar driver container actually works (I realise the flatcar container is not officially supported). So I think we can fix this by patching the driver container, of which we already maintain a fork because of various other issues.

@elezar yes, it makes sense to add an option to disable this.

I guess we can close this issue, unless you want to still add said option. I still think this would make sense, at least I would use it since this step is clearly unnecessary in our environment and a potential source of problems.

I also want to note that there does not seem to be an option to change the log level for the validator cmd (unless I missed something again). Would have also been helpful here :-)

Anyways, I appreciate the quick reaction.

Aug 10 '23 08:08 heilerich

@heilerich we will consider these enhancements. Thanks

Aug 29 '23 06:08 shivamerla

Any plans to take this up? We have similar requirements to be able to disable some validation init containers

Jan 13 '24 13:01 chiragjn

Toolkit DaemonSet stuck in init phase after upgrade

Symptoms

Issue

Proposed solution

Related