gpu-operator Redeploy of Nvidia GPU operator fails upon upgrade to OpenShift 4.7.9

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[no ] Are you running on an Ubuntu 18.04 node?
[yes ] Are you running Kubernetes v1.13+?
[ yes] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[ no] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[ yes] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

We checked that the pods from gpu-operator-resources namespaces were stuck in init state and some pods were in crashloopbackoff state.
Checking logs of pod which were stuck in crashloopbackoff, we get to know that the pod is trying to check OpenShift version by querying API at 172.30.0.1:443. We tried running debug pod and checked curl command to 172.30.0.1:443 but connection was not happening due to proxy issue. From the curl command output, we observed that the NO_PROXY variable did not contain anything about service CIDR. Also, we checked proxy object but proxy object was correct.
After checking respective daemonset of those pods, we get to know that the ENV variables were not picked up correctly. We updated daemonset configuration by adding 172.30.0.1 in ENV variables of that container spec. After that the check to clusterversion with 172.30.0.1:443 happened successfully. But again there were some errors which are related to NVIDIA repo which are application specific.
We checked the other pods' yaml which are stuck in init state. We get to know there are specification mentioned for init_containers. At node level, we tried checking logs of those init containers with crictl logs but those logs were again specific to application only.
I suggested you to check this problem with NVIDIA vendor first as most of the pods are stuck in init phase with no confirmed reason/error. In case, NVIDIA says something needs to be checked from OCP end then we can check it after you provide their analysis and exact error messages.
Currently,as you are going to check with NVIDIA vendor, I am keeping the case status as waiting on customer. Please update us in case you require help from OpenShift end.### 2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

[ x] kubernetes pods status: kubectl get pods --all-namespaces
[ x allds.txt allpods.txt nvidiapods.txt

] kubernetes daemonset status: kubectl get ds --all-namespaces

[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine: docker run -it alpine echo foo
[ ] Docker configuration file: cat /etc/docker/daemon.json
[ ] Docker runtime configuration: docker info | grep runtime
[ ] NVIDIA shared directory: ls -la /run/nvidia
[ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory: ls -la /run/nvidia/driver
[ ] kubelet logs journalctl -u kubelet > kubelet.logs

Jun 03 '21 18:06 damora

Hello @damora,

what version of the GPU Operator is currently installed in your cluster? only the last one (1.7.0) supports OpenShift cluster upgrade, otherwise the GPU Operator must be uninstalled (ideally before the upgrade) and reinstalled afterwards

the logs of the driver container would help understand what is going wrong

On Thu, Jun 3, 2021 at 8:07 PM damora @.***> wrote:

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Quick Debug Checklist

[no ] Are you running on an Ubuntu 18.04 node?

[yes ] Are you running Kubernetes v1.13+?

[ yes] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?

[ no] Do you have i2c_core and ipmi_msghandler loaded on the nodes?

[ yes] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

Issue or feature description

We checked that the pods from gpu-operator-resources namespaces were stuck in init state and some pods were in crashloopbackoff state.

Checking logs of pod which were stuck in crashloopbackoff, we get to know that the pod is trying to check OpenShift version by querying API at 172.30.0.1:443. We tried running debug pod and checked curl command to 172.30.0.1:443 but connection was not happening due to proxy issue. From the curl command output, we observed that the NO_PROXY variable did not contain anything about service CIDR. Also, we checked proxy object but proxy object was correct.

After checking respective daemonset of those pods, we get to know that the ENV variables were not picked up correctly. We updated daemonset configuration by adding 172.30.0.1 in ENV variables of that container spec. After that the check to clusterversion with 172.30.0.1:443 happened successfully. But again there were some errors which are related to NVIDIA repo which are application specific.

We checked the other pods' yaml which are stuck in init state. We get to know there are specification mentioned for init_containers. At node level, we tried checking logs of those init containers with crictl logs but those logs were again specific to application only.

I suggested you to check this problem with NVIDIA vendor first as most of the pods are stuck in init phase with no confirmed reason/error. In case, NVIDIA says something needs to be checked from OCP end then we can check it after you provide their analysis and exact error messages.

Currently,as you are going to check with NVIDIA vendor, I am keeping the case status as waiting on customer. Please update us in case you require help from OpenShift end.### 2. Steps to reproduce the issue

Information to attach https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/ (optional if deemed irrelevant)

[ x] kubernetes pods status: kubectl get pods --all-namespaces

[ x allds.txt https://github.com/NVIDIA/gpu-operator/files/6593500/allds.txt allpods.txt https://github.com/NVIDIA/gpu-operator/files/6593501/allpods.txt nvidiapods.txt https://github.com/NVIDIA/gpu-operator/files/6593502/nvidiapods.txt

] kubernetes daemonset status: kubectl get ds --all-namespaces

If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

Output of running a container on the GPU machine: docker run -it alpine echo foo

Docker configuration file: cat /etc/docker/daemon.json

Docker runtime configuration: docker info | grep runtime

NVIDIA shared directory: ls -la /run/nvidia

NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

NVIDIA driver directory: ls -la /run/nvidia/driver

kubelet logs journalctl -u kubelet > kubelet.logs

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/gpu-operator/issues/199, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZVQIXIRK3DQHKQTNXRCULTQ7AFZANCNFSM46BIFWPQ .

Jun 03 '21 18:06 kpouget

hum, sorry, I missed the links to the different logs. Seems to be 1.7.0 as nvidia-operator-validator is there

Jun 03 '21 18:06 kpouget

@damora GPU Operator applies HTTP_PROXY, HTTPS_PROXY, NO_PROXY variables from the cluster-wide proxy object when deploying driver Daemonset. Do you see NO_PROXY was set correctly in cluster-wide proxy, but not with driver Daemonset? other proxy ENV were set correctly?

Can you paste driver container logs indicating failure installing kernel packages?

Jun 03 '21 18:06 shivamerla

In cluster-wide proxy object: spec: httpProxy: http://10.0.5.50:3128 httpsProxy: http://10.0.5.50:3128 noProxy: foccluster.com,foc.foccluster.com trustedCA: name: "" It is not set correctly in the driver Daemonset. We added 172.30.0.1 since it was trying to use that ipaddr. I think this is a hack that ultimately will not work but we were just experimenting. driver-daemonset.txt driver-ds.txt driverlogs.txt

env: - name: RHEL_VERSION value: "8.3" - name: OPENSHIFT_VERSION value: "4.7" - name: HTTPS_PROXY value: http://10.0.5.50:3128 - name: https_proxy value: http://10.0.5.50:3128 - name: HTTP_PROXY value: http://10.0.5.50:3128 - name: http_proxy value: http://10.0.5.50:3128 - name: NO_PROXY value: 172.30.0.1,foccluster.com,foc.foccluster.com - name: no_proxy value: 172.30.0.1,foccluster.com,foc.foccluster.com - name: RESOLVE_OCP_VERSION value: "true"

Jun 03 '21 19:06 damora

@kpouget
kind: ClusterServiceVersion name: gpu-operator-certified.v1.7.0 namespace: openshift-operators - apiVersion: operators.coreos.com/v1 kind: OperatorCondition name: gpu-operator-certified.v1.7.0 namespace: openshift-operators

Jun 03 '21 19:06 damora

@damora we might need to add the kubernetes service cluster-ip to the NO_PROXY in cluster-wide proxy.

# oc get service kubernetes -n default
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   172.30.0.1   <none>        443/TCP   6h19m

Regarding second issue, can you double check if cluster is property entitled? https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html

$ cat << EOF >> mypod.yaml 
apiVersion: v1
kind: Pod
metadata:
 name: cluster-entitled-build-pod
spec:
 containers:
   - name: cluster-entitled-build
     image: registry.access.redhat.com/ubi8:latest
     command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ]
 restartPolicy: Never
EOF
$ oc create -f mypod.yaml
 

$ oc logs cluster-entitled-build-pod -n default

Jun 03 '21 20:06 shivamerla

the pod failed to pull image:Events: Type Reason Age From Message

Normal Scheduled 92s default-scheduler Successfully assigned gpu-operator-resources/cluster-entitled-build-pod to smicro06 Normal AddedInterface 91s multus Add eth0 [10.129.2.222/23] Normal Pulling 52s (x3 over 90s) kubelet Pulling image "registry.access.redhat.com/ubi8:latest" Warning Failed 52s (x3 over 90s) kubelet Failed to pull image "registry.access.redhat.com/ubi8:latest": rpc error: code = Unknown desc = error pinging docker registry registry.access.redhat.com: Get "https://registry.access.redhat.com/v2/": proxyconnect tcp: dial tcp 10.0.5.50:3128: connect: connection refused Warning Failed 52s (x3 over 90s) kubelet Error: ErrImagePull Normal BackOff 27s (x5 over 90s) kubelet Back-off pulling image "registry.access.redhat.com/ubi8:latest" Warning Failed 27s (x5 over 90s) kubelet Error: ImagePullBackOff

I did verify that I can reach registry.access.redhat.com from the node that this pod is running on:[damora@smicro02 ~]$ oc debug node/smicro06 Starting pod/smicro06-debug ... To use host binaries, run chroot /host Pod IP: 10.0.5.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ping registry.access.redhat.com PING e40408.d.akamaiedge.net (96.6.42.35) 56(84) bytes of data. 64 bytes from a96-6-42-35.deploy.static.akamaitechnologies.com (96.6.42.35): icmp_seq=1 ttl=53 time=12.2 ms 64 bytes from a96-6-42-35.deploy.static.akamaitechnologies.com (96.6.42.35): icmp_seq=2 ttl=53 time=12.0 ms 64 bytes from a96-6-42-35.deploy.static.akamaitechnologies.com (96.6.42.35): icmp_seq=3 ttl=53 time=12.1 ms 64 bytes from a96-6-42-35.deploy.static.akamaitechnologies.com (96.6.42.35): icmp_seq=4 ttl=53 time=12.6 ms

It can also pull image from the node:[damora@smicro02 ~]$ oc debug node/smicro06 Starting pod/smicro06-debug ... To use host binaries, run chroot /host Pod IP: 10.0.5.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# podman pull registry.access.redhat.com/ubi8:latest Trying to pull registry.access.redhat.com/ubi8:latest... Getting image source signatures Copying blob 053724d29990 done
Copying blob f0ae454850a7 done
Copying config 272209ff0a done
Writing manifest to image destination Storing signatures 272209ff0ae5fe54c119b9c32a25887e13625c9035a1599feba654aa7638262d sh-4.4#

Jun 03 '21 20:06 damora

@damora if you have pulled this image manually, then you should be able to run with imagePullPolicy: IfNotPresent as set below. CRI-O should have been setup with proxy env when cluster-wide proxy is applied. Not sure why its not setup.

$ cat << EOF >> mypod.yaml 
apiVersion: v1
kind: Pod
metadata:
 name: cluster-entitled-build-pod
spec:
 containers:
   - name: cluster-entitled-build
     image: registry.access.redhat.com/ubi8:latest
     imagePullPolicy: IfNotPresent
     command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ]
     env:
      - name: HTTPS_PROXY
      value: http://10.0.5.50:3128
      - name: https_proxy
      value: http://10.0.5.50:3128
      - name: HTTP_PROXY
      value: http://10.0.5.50:3128
      - name: http_proxy
      value: http://10.0.5.50:3128
 restartPolicy: Never
EOF
$ oc create -f mypod.yaml
 

$ oc logs cluster-entitled-build-pod -n default

Jun 03 '21 21:06 shivamerla

same issue as before:Events: Type Reason Age From Message

Normal Scheduled 51s default-scheduler Successfully assigned gpu-operator-resources/cluster-entitled-build-pod to smicro06 Normal AddedInterface 49s multus Add eth0 [10.129.2.224/23] Normal Pulling 32s (x2 over 48s) kubelet Pulling image "registry.access.redhat.com/ubi8:latest" Warning Failed 32s (x2 over 48s) kubelet Failed to pull image "registry.access.redhat.com/ubi8:latest": rpc error: code = Unknown desc = error pinging docker registry registry.access.redhat.com: Get "https://registry.access.redhat.com/v2/": proxyconnect tcp: dial tcp 10.0.5.50:3128: connect: connection refused Warning Failed 32s (x2 over 48s) kubelet Error: ErrImagePull Normal BackOff 20s (x3 over 47s) kubelet Back-off pulling image "registry.access.redhat.com/ubi8:latest" Warning Failed 20s (x3 over 47s) kubelet Error: ImagePullBackOff

This is my yaml file: apiVersion: v1 kind: Pod metadata: name: cluster-entitled-build-pod spec: containers:

name: cluster-entitled-build image: registry.access.redhat.com/ubi8:latest command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ] env:
- name: HTTPS_PROXY value: http://10.0.5.50:3128
- name: https_proxy value: http://10.0.5.50:3128
- name: HTTP_PROXY value: http://10.0.5.50:3128
- name: http_proxy value: http://10.0.5.50:3128 restartPolicy: Never

Jun 04 '21 13:06 damora

@damora you will need to add imagePullPolicy: IfNotPresent as i have added in my comment above.

Jun 04 '21 15:06 shivamerla

oc logs cluster-entitled-build-pod Updating Subscription Management repositories. Unable to read consumer identity Subscription Manager is operating in container mode. Red Hat Enterprise Linux 8 for x86_64 - BaseOS 0.0 B/s | 0 B 00:00
Errors during downloading metadata for repository 'rhel-8-for-x86_64-baseos-rpms':

Curl error (7): Couldn't connect to server for https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os/repodata/repomd.xml [Failed to connect to 10.0.5.50 port 3128: Connection refused] Error: Failed to download metadata for repo 'rhel-8-for-x86_64-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

Describe pod:Events: Type Reason Age From Message

Normal Scheduled 2m33s default-scheduler Successfully assigned gpu-operator-resources/cluster-entitled-build-pod to smicro06 Normal AddedInterface 2m32s multus Add eth0 [10.129.2.225/23] Normal Pulled 2m32s kubelet Container image "registry.access.redhat.com/ubi8:latest" already present on machine Normal Created 2m31s kubelet Created container cluster-entitled-build Normal Started 2m31s kubelet Started container cluster-entitled-build

Jun 04 '21 17:06 damora

@damora There seems to be a proxy issue going on here. Can you work with RH support to resolve it. That should help driver container to pull/install necessary kernel-headers/kernel-devel packages.

Jun 04 '21 17:06 shivamerla

why did this start with 4.7.9? GPU install was working before that and nothing has changed regarding cluster DNS, DHCP, or HTTP servers

Jun 04 '21 17:06 damora

only change with driver container in v1.7.0 is to fetch cluster version to add correct OCP EUS repos. This needed change to NO_PROXY variable in cluster-wide proxy before deploying operator. After this is fixed, looks like its now failing to actually reach any of the package repositories through proxy. Given that CRI-O is not able to pull images as well, there seems to be a proxy issue introduced with 4.7.9 upgrade.

Jun 04 '21 18:06 shivamerla

@damora any further update on this? Were you able to verify entitlements and get driver install working?

Jun 18 '21 14:06 shivamerla

@damora any further update on this? Were you able to verify entitlements and get driver install working?

still trying to debug this. It appears that the running container cannot access nvcr.io/v2 via the proxy. If I login into proxy server and then curl nvcr.io, I can access, but from within the running container I cannot.

I can login to a worker node and execute: podman pull nvcr.io/nvidia/driver

no issues when doing that. Just not from within the daemonset driver container. If I understand correctly, until the driver is installed the rest of the containers can't complete their initiatlization and the whole GPU operator fails

Jun 18 '21 15:06 damora

The error we are seeing is:
state: waiting: message: 'rpc error: code = Unknown desc = error pinging docker registry nvcr.io: Get "https://nvcr.io/v2/": Method Not Allowed' reason: ErrImagePull

Seems to me it can't get to the registry from within the pod. This is not a proxy issue

Jun 24 '21 18:06 damora

@damora image pull will happen before the container is started, with this error looks like kubelet/CRI-O is not able to pull the image in this case to launch driver container.

Jun 24 '21 18:06 shivamerla

@damora image pull will happen before the container is started, with this error looks like kubelet/CRI-O is not able to pull the image in this case to launch driver container.

yes, that is the problem, but why? I can do a podman pull from the node. Shouldn't it work from container?

Jun 24 '21 18:06 damora

@damora i think there is still confusion here. Within driver container no attempt will be made to reach to nvcr.io (our image registry) when it runs. Here you seem to hit error before launch of driver container itself as image pull is failing by CRI-O. Its proxy config needs to be verified. Once driver container starts, it will then pull RPM packages required to install the driver. We haven't seem to reach that stage yet. Did you open any RedHat bug for this?

@kpouget can you help with this once bugzilla is shared?

Jun 24 '21 18:06 shivamerla

@shivamerla the error is occurring in this container: nvidia-driver-daemonset-2t6fr 0/1 ImagePullBackOff 0 55m

so are you saying that it is not trying to pull the image to start this container?

Jun 24 '21 18:06 damora

he error is occurring in this container:

thats not correct :) error is occurring when CRI-O is trying to pull the image to start the container using that image. Thats what the line indicates.

Jun 24 '21 18:06 shivamerla

he error is occurring in this container:

thats not correct :) error is occurring when CRI-O is trying to pull the image to start the container using that image. Thats what the line indicates.

crictl pull nvcr.io/v2/driver FATA[0000] pulling image: rpc error: code = Unknown desc = error pinging docker registry nvcr.io: Get "https://nvcr.io/v2/": Method Not Allowed

How do I fix this? I think somewhere in this issue, somebody suggested that it was a proxy problem but I have verified that I can connect to proxy server. It seems I can't connect to driver registry though.

Jun 24 '21 19:06 damora

Can you try crictl pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7

Jun 24 '21 19:06 shivamerla

crictl pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7 crictl pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7 FATA[0000] pulling image: rpc error: code = Unknown desc = error pinging docker registry nvcr.io: Get "https://nvcr.io/v2/": Method Not Allowed

Jun 24 '21 19:06 damora

Please open a ticket with RH and share with us. They should be able to resolve this with CRI-O.

$crictl pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7
Image is up to date for sha256:edbc81acfc959b87229ba22e09c8e8f4f2305d8e8c3d2f995ec5d7c40cd7024b

Jun 24 '21 19:06 shivamerla

@shivamerla I did open a ticket with RH -02949372 They are asking me "Can you let me know if you have the credentials to access the registry nvcr.io/v2?" why would I have credentials on this registry? I never needed them before.

Jun 24 '21 19:06 damora

You can let them know that these are public registries and no login is required: https://ngc.nvidia.com/catalog/containers/nvidia:driver

cc: @zvonkok @kpouget for help with RH ticket: 02949372

Jun 24 '21 19:06 shivamerla

interesting, it's easy to reproduce on Fc34:

$ podman pull  nvcr.io/nvidia/driver:460.73.01-rhcos4.7
Trying to pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7...
Getting image source signatures
Copying blob 4b21dcdd136d skipped: already exists
Copying blob 55eda7743468 skipped: already exists
Copying blob ffda01e5a185 done

but

$ crictl pull  nvcr.io/nvidia/driver:460.73.01-rhcos4.7
FATA[0000] pulling image failed: rpc error: code = Unavailable desc = grpc: the connection is unavailable

Jun 24 '21 20:06 kpouget

@kpouget is fc34 the build version of OpenShift or something else?

Jun 25 '21 12:06 damora