gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NOTICE: NVIDIA Driver Pods are failing due to CUDA linux repository GPG key rotation

Open shivamerla opened this issue 3 years ago • 9 comments

What happened:

NVIDIA team rotated GPG keys for CUDA linux repositories on 4/28. More information on this can be found here. CUDA repository is included in the NVIDIA driver images deployed through GPU operator and causing failures during apt-get update on Ubuntu 18.04 and Ubuntu 20.04. This happens whenever current running driver containers are restarted or node reboots. This does not impact RHCOS/RHEL or CentOS systems.

Following error message will be seen from driver Pod (nvidia-driver-ctr container):

Updating the package cache...
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease' is no longer signed.

Fix:

We have updated below driver images for Ubuntu 18.04 and Ubuntu 20.04

  • 510.47.03
  • 470.103.01
  • 450.172.01
  • 470-signed
  • 510-signed

In order to fetch updated driver images, follow below steps:

  • kubectl edit clusterpolicy
  • set driver.imagePullPolicy=Always

This will cause newer image to be pulled that fixes GPG key error.

If you are using older driver versions, upgrade to these driver versions is recommended. Please open a support ticket or create an issue here if you cannot upgrade.

shivamerla avatar Apr 29 '22 20:04 shivamerla

Same issue on my end...

814HiManny avatar Apr 29 '22 22:04 814HiManny

The version of the driver I was using was 470.82.01. Using the list above I had to set driver.version=470.103.01, and it fixed the error.

I'm not sure if this is the right way to update the driver though. I would assume it should be done through Helm as this is how I installed GPU Operator.

gchazot avatar May 04 '22 17:05 gchazot

@gchazot It can be done through helm upgrade as well with same version of the chart but by changing driver imagePullPolicy. Either approaches will result in same.

shivamerla avatar May 05 '22 22:05 shivamerla

Hi, I am getting this error even with 470.103.01 driver. This version works on another machine last week but didn't work on a new machine I tried to install today.

I also tried with

kubectl edit clusterpolicy
set driver.imagePullPolicy=Always
Starting installation of NVIDIA driver version 470.103.01 for Linux kernel version 5.13.0-39-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease' is not signed.

Update: I fixed this issue by delete image from containerd and force it to pull again.

fanminshi avatar May 16 '22 15:05 fanminshi

@fanminshi what do mean by deleting image from containerd? i am very new. can you demonstrate or show some guide? I am using helm to deploy gpu operator. does your trial same applies to helm?

jinwonkim93 avatar May 17 '22 09:05 jinwonkim93

@jinwonkim93 I use only containerd to run k8s. So I used ctr -n k8s.io image delete <gpu-driver-container> to delete the locally cached image.T hen this forces k8s to pull new driver image containing the fix.

I am using helm to deploy gpu operator. does your trial same applies to helm? if you update your gpu driver version in helm, then it should work because your k8s will download the new one. If you didn't change driver version in helm, my fix should work because it forces to download the new container image with fix.

fanminshi avatar May 17 '22 15:05 fanminshi

@fanminshi @jinwonkim93 weird that imagePullPolicy=Always didn't update the image. Can you double by describing the driver pod that new image is pulled?

shivamerla avatar May 18 '22 01:05 shivamerla

@shivamerla me either.

update: looks like driver image i use wasn't update

jinwonkim93 avatar May 18 '22 05:05 jinwonkim93

@shivamerla It might be a user error on my side. I recalled that I did

kubectl edit clusterpolicy
set driver.imagePullPolicy=Always

And check the nvidia-driver-dameon pod yaml and didn't see the imagePullPolicy got updated to Always. So I just tried the fix i mentioned before.

fanminshi avatar May 19 '22 21:05 fanminshi