gpu-operator
gpu-operator copied to clipboard
After the GPU node is restarted, an error occurs when the nvidia-driver-daemonset pod is started in the offline environment
After using gpu-operator to integrate the GPU successfully, when restarting the GPU node, can I not reinstall the driver?Because my K8S cluster cannot access the public network under normal conditions, every time the nvidia-driver-daemonset pod is restarted, it needs to be connected to the network to complete the startup, otherwise the error will be reported:
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 5.15.0-67-generic
Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache... E: The repository 'http://archive.ubuntu.com/ubuntu focal InRelease' is not signed. E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: The repository 'http://archive.ubuntu.com/ubuntu focal-updates InRelease' is not signed. E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: The repository 'http://archive.ubuntu.com/ubuntu focal-security InRelease' is not signed. Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...
I tried setting driver.upgradePolicy.autoUpgrade to false and it didn't work either