Redeploy of Nvidia GPU operator fails upon upgrade to OpenShift 4.7.9
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [no ] Are you running on an Ubuntu 18.04 node?
- [yes ] Are you running Kubernetes v1.13+?
- [ yes] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ no] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [ yes] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
- We checked that the pods from gpu-operator-resources namespaces were stuck in init state and some pods were in crashloopbackoff state.
- Checking logs of pod which were stuck in crashloopbackoff, we get to know that the pod is trying to check OpenShift version by querying API at 172.30.0.1:443. We tried running debug pod and checked curl command to 172.30.0.1:443 but connection was not happening due to proxy issue. From the curl command output, we observed that the NO_PROXY variable did not contain anything about service CIDR. Also, we checked proxy object but proxy object was correct.
- After checking respective daemonset of those pods, we get to know that the ENV variables were not picked up correctly. We updated daemonset configuration by adding 172.30.0.1 in ENV variables of that container spec. After that the check to clusterversion with 172.30.0.1:443 happened successfully. But again there were some errors which are related to NVIDIA repo which are application specific.
- We checked the other pods' yaml which are stuck in init state. We get to know there are specification mentioned for init_containers. At node level, we tried checking logs of those init containers with crictl logs but those logs were again specific to application only.
- I suggested you to check this problem with NVIDIA vendor first as most of the pods are stuck in init phase with no confirmed reason/error. In case, NVIDIA says something needs to be checked from OCP end then we can check it after you provide their analysis and exact error messages.
- Currently,as you are going to check with NVIDIA vendor, I am keeping the case status as waiting on customer. Please update us in case you require help from OpenShift end.### 2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
- [ x] kubernetes pods status:
kubectl get pods --all-namespaces - [ x allds.txt allpods.txt nvidiapods.txt
] kubernetes daemonset status: kubectl get ds --all-namespaces
-
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME -
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME -
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo -
[ ] Docker configuration file:
cat /etc/docker/daemon.json -
[ ] Docker runtime configuration:
docker info | grep runtime -
[ ] NVIDIA shared directory:
ls -la /run/nvidia -
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit -
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver -
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs
Hello @damora,
what version of the GPU Operator is currently installed in your cluster? only the last one (1.7.0) supports OpenShift cluster upgrade, otherwise the GPU Operator must be uninstalled (ideally before the upgrade) and reinstalled afterwards
the logs of the driver container would help understand what is going wrong
On Thu, Jun 3, 2021 at 8:07 PM damora @.***> wrote:
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Quick Debug Checklist
- [no ] Are you running on an Ubuntu 18.04 node?
- [yes ] Are you running Kubernetes v1.13+?
- [ yes] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ no] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
- [ yes] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)
Issue or feature description
- We checked that the pods from gpu-operator-resources namespaces were stuck in init state and some pods were in crashloopbackoff state.
- Checking logs of pod which were stuck in crashloopbackoff, we get to know that the pod is trying to check OpenShift version by querying API at 172.30.0.1:443. We tried running debug pod and checked curl command to 172.30.0.1:443 but connection was not happening due to proxy issue. From the curl command output, we observed that the NO_PROXY variable did not contain anything about service CIDR. Also, we checked proxy object but proxy object was correct.
- After checking respective daemonset of those pods, we get to know that the ENV variables were not picked up correctly. We updated daemonset configuration by adding 172.30.0.1 in ENV variables of that container spec. After that the check to clusterversion with 172.30.0.1:443 happened successfully. But again there were some errors which are related to NVIDIA repo which are application specific.
- We checked the other pods' yaml which are stuck in init state. We get to know there are specification mentioned for init_containers. At node level, we tried checking logs of those init containers with crictl logs but those logs were again specific to application only.
- I suggested you to check this problem with NVIDIA vendor first as most of the pods are stuck in init phase with no confirmed reason/error. In case, NVIDIA says something needs to be checked from OCP end then we can check it after you provide their analysis and exact error messages.
- Currently,as you are going to check with NVIDIA vendor, I am keeping the case status as waiting on customer. Please update us in case you require help from OpenShift end.### 2. Steps to reproduce the issue
Information to attach https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/ (optional if deemed irrelevant)
- [ x] kubernetes pods status: kubectl get pods --all-namespaces
- [ x allds.txt https://github.com/NVIDIA/gpu-operator/files/6593500/allds.txt allpods.txt https://github.com/NVIDIA/gpu-operator/files/6593501/allpods.txt nvidiapods.txt https://github.com/NVIDIA/gpu-operator/files/6593502/nvidiapods.txt
] kubernetes daemonset status: kubectl get ds --all-namespaces
If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
Output of running a container on the GPU machine: docker run -it alpine echo foo
Docker configuration file: cat /etc/docker/daemon.json
Docker runtime configuration: docker info | grep runtime
NVIDIA shared directory: ls -la /run/nvidia
NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
NVIDIA driver directory: ls -la /run/nvidia/driver
kubelet logs journalctl -u kubelet > kubelet.logs
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/gpu-operator/issues/199, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZVQIXIRK3DQHKQTNXRCULTQ7AFZANCNFSM46BIFWPQ .
hum, sorry, I missed the links to the different logs. Seems to be 1.7.0 as nvidia-operator-validator is there
@damora GPU Operator applies HTTP_PROXY, HTTPS_PROXY, NO_PROXY variables from the cluster-wide proxy object when deploying driver Daemonset. Do you see NO_PROXY was set correctly in cluster-wide proxy, but not with driver Daemonset? other proxy ENV were set correctly?
Can you paste driver container logs indicating failure installing kernel packages?
In cluster-wide proxy object: spec: httpProxy: http://10.0.5.50:3128 httpsProxy: http://10.0.5.50:3128 noProxy: foccluster.com,foc.foccluster.com trustedCA: name: "" It is not set correctly in the driver Daemonset. We added 172.30.0.1 since it was trying to use that ipaddr. I think this is a hack that ultimately will not work but we were just experimenting. driver-daemonset.txt driver-ds.txt driverlogs.txt
env: - name: RHEL_VERSION value: "8.3" - name: OPENSHIFT_VERSION value: "4.7" - name: HTTPS_PROXY value: http://10.0.5.50:3128 - name: https_proxy value: http://10.0.5.50:3128 - name: HTTP_PROXY value: http://10.0.5.50:3128 - name: http_proxy value: http://10.0.5.50:3128 - name: NO_PROXY value: 172.30.0.1,foccluster.com,foc.foccluster.com - name: no_proxy value: 172.30.0.1,foccluster.com,foc.foccluster.com - name: RESOLVE_OCP_VERSION value: "true"
@kpouget
kind: ClusterServiceVersion
name: gpu-operator-certified.v1.7.0
namespace: openshift-operators
- apiVersion: operators.coreos.com/v1
kind: OperatorCondition
name: gpu-operator-certified.v1.7.0
namespace: openshift-operators
@damora we might need to add the kubernetes service cluster-ip to the NO_PROXY in cluster-wide proxy.
# oc get service kubernetes -n default
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 6h19m
Regarding second issue, can you double check if cluster is property entitled? https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html
$ cat << EOF >> mypod.yaml
apiVersion: v1
kind: Pod
metadata:
name: cluster-entitled-build-pod
spec:
containers:
- name: cluster-entitled-build
image: registry.access.redhat.com/ubi8:latest
command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ]
restartPolicy: Never
EOF
$ oc create -f mypod.yaml
$ oc logs cluster-entitled-build-pod -n default
the pod failed to pull image:Events: Type Reason Age From Message
Normal Scheduled 92s default-scheduler Successfully assigned gpu-operator-resources/cluster-entitled-build-pod to smicro06 Normal AddedInterface 91s multus Add eth0 [10.129.2.222/23] Normal Pulling 52s (x3 over 90s) kubelet Pulling image "registry.access.redhat.com/ubi8:latest" Warning Failed 52s (x3 over 90s) kubelet Failed to pull image "registry.access.redhat.com/ubi8:latest": rpc error: code = Unknown desc = error pinging docker registry registry.access.redhat.com: Get "https://registry.access.redhat.com/v2/": proxyconnect tcp: dial tcp 10.0.5.50:3128: connect: connection refused Warning Failed 52s (x3 over 90s) kubelet Error: ErrImagePull Normal BackOff 27s (x5 over 90s) kubelet Back-off pulling image "registry.access.redhat.com/ubi8:latest" Warning Failed 27s (x5 over 90s) kubelet Error: ImagePullBackOff
I did verify that I can reach registry.access.redhat.com from the node that this pod is running on:[damora@smicro02 ~]$ oc debug node/smicro06
Starting pod/smicro06-debug ...
To use host binaries, run chroot /host
Pod IP: 10.0.5.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ping registry.access.redhat.com
PING e40408.d.akamaiedge.net (96.6.42.35) 56(84) bytes of data.
64 bytes from a96-6-42-35.deploy.static.akamaitechnologies.com (96.6.42.35): icmp_seq=1 ttl=53 time=12.2 ms
64 bytes from a96-6-42-35.deploy.static.akamaitechnologies.com (96.6.42.35): icmp_seq=2 ttl=53 time=12.0 ms
64 bytes from a96-6-42-35.deploy.static.akamaitechnologies.com (96.6.42.35): icmp_seq=3 ttl=53 time=12.1 ms
64 bytes from a96-6-42-35.deploy.static.akamaitechnologies.com (96.6.42.35): icmp_seq=4 ttl=53 time=12.6 ms
It can also pull image from the node:[damora@smicro02 ~]$ oc debug node/smicro06
Starting pod/smicro06-debug ...
To use host binaries, run chroot /host
Pod IP: 10.0.5.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# podman pull registry.access.redhat.com/ubi8:latest
Trying to pull registry.access.redhat.com/ubi8:latest...
Getting image source signatures
Copying blob 053724d29990 done
Copying blob f0ae454850a7 done
Copying config 272209ff0a done
Writing manifest to image destination
Storing signatures
272209ff0ae5fe54c119b9c32a25887e13625c9035a1599feba654aa7638262d
sh-4.4#
@damora if you have pulled this image manually, then you should be able to run with imagePullPolicy: IfNotPresent as set below. CRI-O should have been setup with proxy env when cluster-wide proxy is applied. Not sure why its not setup.
$ cat << EOF >> mypod.yaml
apiVersion: v1
kind: Pod
metadata:
name: cluster-entitled-build-pod
spec:
containers:
- name: cluster-entitled-build
image: registry.access.redhat.com/ubi8:latest
imagePullPolicy: IfNotPresent
command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ]
env:
- name: HTTPS_PROXY
value: http://10.0.5.50:3128
- name: https_proxy
value: http://10.0.5.50:3128
- name: HTTP_PROXY
value: http://10.0.5.50:3128
- name: http_proxy
value: http://10.0.5.50:3128
restartPolicy: Never
EOF
$ oc create -f mypod.yaml
$ oc logs cluster-entitled-build-pod -n default
same issue as before:Events: Type Reason Age From Message
Normal Scheduled 51s default-scheduler Successfully assigned gpu-operator-resources/cluster-entitled-build-pod to smicro06 Normal AddedInterface 49s multus Add eth0 [10.129.2.224/23] Normal Pulling 32s (x2 over 48s) kubelet Pulling image "registry.access.redhat.com/ubi8:latest" Warning Failed 32s (x2 over 48s) kubelet Failed to pull image "registry.access.redhat.com/ubi8:latest": rpc error: code = Unknown desc = error pinging docker registry registry.access.redhat.com: Get "https://registry.access.redhat.com/v2/": proxyconnect tcp: dial tcp 10.0.5.50:3128: connect: connection refused Warning Failed 32s (x2 over 48s) kubelet Error: ErrImagePull Normal BackOff 20s (x3 over 47s) kubelet Back-off pulling image "registry.access.redhat.com/ubi8:latest" Warning Failed 20s (x3 over 47s) kubelet Error: ImagePullBackOff
This is my yaml file: apiVersion: v1 kind: Pod metadata: name: cluster-entitled-build-pod spec: containers:
- name: cluster-entitled-build
image: registry.access.redhat.com/ubi8:latest
command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ]
env:
- name: HTTPS_PROXY value: http://10.0.5.50:3128
- name: https_proxy value: http://10.0.5.50:3128
- name: HTTP_PROXY value: http://10.0.5.50:3128
- name: http_proxy value: http://10.0.5.50:3128 restartPolicy: Never
@damora you will need to add imagePullPolicy: IfNotPresent as i have added in my comment above.
oc logs cluster-entitled-build-pod
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Red Hat Enterprise Linux 8 for x86_64 - BaseOS 0.0 B/s | 0 B 00:00
Errors during downloading metadata for repository 'rhel-8-for-x86_64-baseos-rpms':
- Curl error (7): Couldn't connect to server for https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os/repodata/repomd.xml [Failed to connect to 10.0.5.50 port 3128: Connection refused] Error: Failed to download metadata for repo 'rhel-8-for-x86_64-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
Describe pod:Events: Type Reason Age From Message
Normal Scheduled 2m33s default-scheduler Successfully assigned gpu-operator-resources/cluster-entitled-build-pod to smicro06 Normal AddedInterface 2m32s multus Add eth0 [10.129.2.225/23] Normal Pulled 2m32s kubelet Container image "registry.access.redhat.com/ubi8:latest" already present on machine Normal Created 2m31s kubelet Created container cluster-entitled-build Normal Started 2m31s kubelet Started container cluster-entitled-build
@damora There seems to be a proxy issue going on here. Can you work with RH support to resolve it. That should help driver container to pull/install necessary kernel-headers/kernel-devel packages.
why did this start with 4.7.9? GPU install was working before that and nothing has changed regarding cluster DNS, DHCP, or HTTP servers
only change with driver container in v1.7.0 is to fetch cluster version to add correct OCP EUS repos. This needed change to NO_PROXY variable in cluster-wide proxy before deploying operator. After this is fixed, looks like its now failing to actually reach any of the package repositories through proxy. Given that CRI-O is not able to pull images as well, there seems to be a proxy issue introduced with 4.7.9 upgrade.
@damora any further update on this? Were you able to verify entitlements and get driver install working?
@damora any further update on this? Were you able to verify entitlements and get driver install working?
still trying to debug this. It appears that the running container cannot access nvcr.io/v2 via the proxy. If I login into proxy server and then curl nvcr.io, I can access, but from within the running container I cannot.
I can login to a worker node and execute: podman pull nvcr.io/nvidia/driver
no issues when doing that. Just not from within the daemonset driver container. If I understand correctly, until the driver is installed the rest of the containers can't complete their initiatlization and the whole GPU operator fails
The error we are seeing is:
state:
waiting:
message: 'rpc error: code = Unknown desc = error pinging docker registry nvcr.io: Get "https://nvcr.io/v2/": Method Not Allowed'
reason: ErrImagePull
Seems to me it can't get to the registry from within the pod. This is not a proxy issue
@damora image pull will happen before the container is started, with this error looks like kubelet/CRI-O is not able to pull the image in this case to launch driver container.
@damora image pull will happen before the container is started, with this error looks like kubelet/CRI-O is not able to pull the image in this case to launch driver container.
yes, that is the problem, but why? I can do a podman pull from the node. Shouldn't it work from container?
@damora i think there is still confusion here. Within driver container no attempt will be made to reach to nvcr.io (our image registry) when it runs. Here you seem to hit error before launch of driver container itself as image pull is failing by CRI-O. Its proxy config needs to be verified. Once driver container starts, it will then pull RPM packages required to install the driver. We haven't seem to reach that stage yet. Did you open any RedHat bug for this?
@kpouget can you help with this once bugzilla is shared?
@shivamerla the error is occurring in this container: nvidia-driver-daemonset-2t6fr 0/1 ImagePullBackOff 0 55m
so are you saying that it is not trying to pull the image to start this container?
he error is occurring in this container:
thats not correct :) error is occurring when CRI-O is trying to pull the image to start the container using that image. Thats what the line indicates.
he error is occurring in this container:
thats not correct :) error is occurring when CRI-O is trying to pull the image to start the container using that image. Thats what the line indicates.
crictl pull nvcr.io/v2/driver FATA[0000] pulling image: rpc error: code = Unknown desc = error pinging docker registry nvcr.io: Get "https://nvcr.io/v2/": Method Not Allowed
How do I fix this? I think somewhere in this issue, somebody suggested that it was a proxy problem but I have verified that I can connect to proxy server. It seems I can't connect to driver registry though.
Can you try crictl pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7
crictl pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7 crictl pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7 FATA[0000] pulling image: rpc error: code = Unknown desc = error pinging docker registry nvcr.io: Get "https://nvcr.io/v2/": Method Not Allowed
Please open a ticket with RH and share with us. They should be able to resolve this with CRI-O.
$crictl pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7
Image is up to date for sha256:edbc81acfc959b87229ba22e09c8e8f4f2305d8e8c3d2f995ec5d7c40cd7024b
@shivamerla I did open a ticket with RH -02949372 They are asking me "Can you let me know if you have the credentials to access the registry nvcr.io/v2?" why would I have credentials on this registry? I never needed them before.
You can let them know that these are public registries and no login is required: https://ngc.nvidia.com/catalog/containers/nvidia:driver
cc: @zvonkok @kpouget for help with RH ticket: 02949372
interesting, it's easy to reproduce on Fc34:
$ podman pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7
Trying to pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7...
Getting image source signatures
Copying blob 4b21dcdd136d skipped: already exists
Copying blob 55eda7743468 skipped: already exists
Copying blob ffda01e5a185 done
but
$ crictl pull nvcr.io/nvidia/driver:460.73.01-rhcos4.7
FATA[0000] pulling image failed: rpc error: code = Unavailable desc = grpc: the connection is unavailable
@kpouget is fc34 the build version of OpenShift or something else?