gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Can not get cuda image from nvcr.io for stable version 1.6.2

Open xiaoduo opened this issue 2 years ago • 3 comments

We were running good on stable 1.6.2 with driver config to version "450.80.02", but after OCP cluster minor version upgraded to 4.6.60, we saw the pods of nvidia-container-toolkit in GPU operator 1.6.2 are no longer able to get :

Failed to pull image "nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59": rpc error: code = Unknown desc = Error reading manifest sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 in nvcr.io/nvidia/cuda: manifest unknown: manifest unknown

Is this image removed from nvcr.io ? What is the correct image should be use in this case?

Thanks!

xiaoduo avatar Sep 20 '22 07:09 xiaoduo

@xiaoduo unfortunately those images were removed due to critical CVEs and have been updated with newer tags. v1.6.2 didn't support changing cuda base images and had pinned to a specific image digest. It was added later on from 1.7.x(which supports OCP 4.6 in this case). The spec looks as here. Can you please upgrade to later versions to fix this?

shivamerla avatar Sep 20 '22 15:09 shivamerla

@shivamerla Thank you very much! We tried to use 1.7.1 instead and it could work fine now on OCP 4.6.60 with NFD 4.6, we will upgrade it to 1.8.2 after OCP move to 4.8.x.

BTW, Do you know where could we subscribe some kind of notification like image removed event for avoiding this kind of issue in future? Thanks in advance!

xiaoduo avatar Sep 23 '22 08:09 xiaoduo

@xiaoduo with recent versions, we have made cuda base image as configurable through ClusterPolicy instance, so hopefully you will not be blocked with those. But will update you more on getting these notifications.

shivamerla avatar Oct 17 '22 05:10 shivamerla