NOTICE: NVIDIA Driver Pods are failing due to CUDA linux repository GPG key rotation
What happened:
NVIDIA team rotated GPG keys for CUDA linux repositories on 4/28. More information on this can be found here. CUDA repository is included in the NVIDIA driver images deployed through GPU operator and causing failures during apt-get update on Ubuntu 18.04 and Ubuntu 20.04. This happens whenever current running driver containers are restarted or node reboots. This does not impact RHCOS/RHEL or CentOS systems.
Following error message will be seen from driver Pod (nvidia-driver-ctr container):
Updating the package cache...
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease' is no longer signed.
Fix:
We have updated below driver images for Ubuntu 18.04 and Ubuntu 20.04
- 510.47.03
- 470.103.01
- 450.172.01
- 470-signed
- 510-signed
In order to fetch updated driver images, follow below steps:
- kubectl edit clusterpolicy
- set
driver.imagePullPolicy=Always
This will cause newer image to be pulled that fixes GPG key error.
If you are using older driver versions, upgrade to these driver versions is recommended. Please open a support ticket or create an issue here if you cannot upgrade.
Same issue on my end...
The version of the driver I was using was 470.82.01. Using the list above I had to set driver.version=470.103.01, and it fixed the error.
I'm not sure if this is the right way to update the driver though. I would assume it should be done through Helm as this is how I installed GPU Operator.
@gchazot It can be done through helm upgrade as well with same version of the chart but by changing driver imagePullPolicy. Either approaches will result in same.
Hi, I am getting this error even with 470.103.01 driver. This version works on another machine last week but didn't work on a new machine I tried to install today.
I also tried with
kubectl edit clusterpolicy
set driver.imagePullPolicy=Always
Starting installation of NVIDIA driver version 470.103.01 for Linux kernel version 5.13.0-39-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease' is not signed.
Update: I fixed this issue by delete image from containerd and force it to pull again.
@fanminshi what do mean by deleting image from containerd? i am very new. can you demonstrate or show some guide? I am using helm to deploy gpu operator. does your trial same applies to helm?
@jinwonkim93
I use only containerd to run k8s. So I used ctr -n k8s.io image delete <gpu-driver-container> to delete the locally cached image.T hen this forces k8s to pull new driver image containing the fix.
I am using helm to deploy gpu operator. does your trial same applies to helm? if you update your gpu driver version in helm, then it should work because your k8s will download the new one. If you didn't change driver version in helm, my fix should work because it forces to download the new container image with fix.
@fanminshi @jinwonkim93 weird that imagePullPolicy=Always didn't update the image. Can you double by describing the driver pod that new image is pulled?
@shivamerla me either.
update: looks like driver image i use wasn't update
@shivamerla It might be a user error on my side. I recalled that I did
kubectl edit clusterpolicy
set driver.imagePullPolicy=Always
And check the nvidia-driver-dameon pod yaml and didn't see the imagePullPolicy got updated to Always. So I just tried the fix i mentioned before.