gpu-operator Can not get cuda image from nvcr.io for stable version 1.6.2

Can not get cuda image from nvcr.io for stable version 1.6.2

Open xiaoduo opened this issue 2 years ago • 3 comments

We were running good on stable 1.6.2 with driver config to version "450.80.02", but after OCP cluster minor version upgraded to 4.6.60, we saw the pods of nvidia-container-toolkit in GPU operator 1.6.2 are no longer able to get :

Failed to pull image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59": rpc error: code = Unknown desc = Error reading manifest sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 in nvcr.io/nvidia/cuda: manifest unknown: manifest unknown

Is this image removed from nvcr.io ? What is the correct image should be use in this case?

Thanks!

Sep 20 '22 07:09 xiaoduo

@xiaoduo unfortunately those images were removed due to critical CVEs and have been updated with newer tags. v1.6.2 didn't support changing cuda base images and had pinned to a specific image digest. It was added later on from 1.7.x(which supports OCP 4.6 in this case). The spec looks as here. Can you please upgrade to later versions to fix this?

Sep 20 '22 15:09 shivamerla

@shivamerla Thank you very much! We tried to use 1.7.1 instead and it could work fine now on OCP 4.6.60 with NFD 4.6, we will upgrade it to 1.8.2 after OCP move to 4.8.x.

BTW, Do you know where could we subscribe some kind of notification like image removed event for avoiding this kind of issue in future? Thanks in advance!

Sep 23 '22 08:09 xiaoduo

@xiaoduo with recent versions, we have made cuda base image as configurable through ClusterPolicy instance, so hopefully you will not be blocked with those. But will update you more on getting these notifications.

Oct 17 '22 05:10 shivamerla

gpu-operator gpu-operator copied to clipboard

Can not get cuda image from nvcr.io for stable version 1.6.2

gpu-operator
gpu-operator copied to clipboard