gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

After the GPU node is restarted, an error occurs when the nvidia-driver-daemonset pod is started in the offline environment

Open sunwuyan opened this issue 10 months ago • 4 comments

After using gpu-operator to integrate the GPU successfully, when restarting the GPU node, can I not reinstall the driver?Because my K8S cluster cannot access the public network under normal conditions, every time the nvidia-driver-daemonset pod is restarted, it needs to be connected to the network to complete the startup, otherwise the error will be reported:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 5.15.0-67-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache... E: The repository 'http://archive.ubuntu.com/ubuntu focal InRelease' is not signed. E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: The repository 'http://archive.ubuntu.com/ubuntu focal-updates InRelease' is not signed. E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: The repository 'http://archive.ubuntu.com/ubuntu focal-security InRelease' is not signed. Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...

I tried setting driver.upgradePolicy.autoUpgrade to false and it didn't work either

sunwuyan avatar Apr 23 '24 06:04 sunwuyan