gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

gpu-operator pod in CrashLoopBackOff

Open smithbk opened this issue 2 years ago • 14 comments

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node?
  • [ ] Are you running Kubernetes v1.13+?
  • [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

The gpu operator pod is in CrashLoopBackOff.

NOTE: This is a follow on to https://github.com/NVIDIA/gpu-operator/issues/330.

2. Steps to reproduce the issue

I am on openshift version 4.6.26 and trying to install the NVIDIA GPU operator v1.7.1 via the console.

3. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status: kubectl get pods --all-namespaces

  • [ ] kubernetes daemonset status: kubectl get ds --all-namespaces

  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

  • [ ] Output of running a container on the GPU machine: docker run -it alpine echo foo

  • [ ] Docker configuration file: cat /etc/docker/daemon.json

  • [ ] Docker runtime configuration: docker info | grep runtime

  • [ ] NVIDIA shared directory: ls -la /run/nvidia

  • [ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

  • [ ] NVIDIA driver directory: ls -la /run/nvidia/driver

  • [ ] kubelet logs journalctl -u kubelet > kubelet.logs

The following shows the state and logs for the gpu operator pod and the logs.

$ oc get pod gpu-operator-566644fc46-2znxj
NAME                            READY   STATUS    RESTARTS   AGE
gpu-operator-566644fc46-2znxj   1/1     Running   5          6m16s
$ oc get pod gpu-operator-566644fc46-2znxj
NAME                            READY   STATUS      RESTARTS   AGE
gpu-operator-566644fc46-2znxj   0/1     OOMKilled   5          6m27s
$ oc get pod gpu-operator-566644fc46-2znxj
NAME                            READY   STATUS             RESTARTS   AGE
gpu-operator-566644fc46-2znxj   0/1     CrashLoopBackOff   5          6m31s
$ oc logs gpu-operator-566644fc46-2znxj -f
I0405 14:09:13.490124       1 request.go:655] Throttling request took 1.043831213s, request: GET:https://172.23.0.1:443/apis/operator.ibm.com/v1?timeout=32s
2022-04-05T14:09:21.300Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2022-04-05T14:09:21.301Z	INFO	controller-runtime.injectors-warning	Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z	INFO	controller-runtime.injectors-warning	Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z	INFO	controller-runtime.injectors-warning	Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z	INFO	controller-runtime.injectors-warning	Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z	INFO	controller-runtime.injectors-warning	Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z	INFO	controller-runtime.injectors-warning	Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z	INFO	controller-runtime.injectors-warning	Injectors are deprecated, and will be removed in v0.10.x
2022-04-05T14:09:21.301Z	INFO	setup	starting manager
I0405 14:09:21.301793       1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/53822513.nvidia.com...
2022-04-05T14:09:21.301Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
I0405 14:09:38.742220       1 leaderelection.go:253] successfully acquired lease openshift-operators/53822513.nvidia.com
2022-04-05T14:09:38.742Z	INFO	controller-runtime.manager.controller.clusterpolicy-controller	Starting EventSource	{"source": "kind source: /, Kind="}
2022-04-05T14:09:38.742Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"openshift-operators","name":"53822513.nvidia.com","uid":"20e58758-fe21-40f9-80b7-3f7d24ecea7e","apiVersion":"v1","resourceVersion":"2508020210"}, "reason": "LeaderElection", "message": "gpu-operator-566644fc46-2znxj_81321000-656f-4d46-bb25-09f9cd573143 became leader"}
2022-04-05T14:09:38.742Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Lease","namespace":"openshift-operators","name":"53822513.nvidia.com","uid":"1f1a391c-afac-4452-b669-3543e388e16f","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2508020211"}, "reason": "LeaderElection", "message": "gpu-operator-566644fc46-2znxj_81321000-656f-4d46-bb25-09f9cd573143 became leader"}
2022-04-05T14:09:38.843Z	INFO	controller-runtime.manager.controller.clusterpolicy-controller	Starting EventSource	{"source": "kind source: /, Kind="}
2022-04-05T14:09:38.943Z	INFO	controller-runtime.manager.controller.clusterpolicy-controller	Starting EventSource	{"source": "kind source: /, Kind="}
2022-04-05T14:09:39.949Z	INFO	controller-runtime.manager.controller.clusterpolicy-controller	Starting Controller
2022-04-05T14:09:39.949Z	INFO	controller-runtime.manager.controller.clusterpolicy-controller	Starting workers	{"worker count": 1}
2022-04-05T14:09:39.955Z	INFO	controllers.ClusterPolicy	Getting assets from: 	{"path:": "/opt/gpu-operator/pre-requisites"}
2022-04-05T14:09:39.956Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Namespace", "in path:": "/opt/gpu-operator/pre-requisites"}
2022-04-05T14:09:39.956Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "RuntimeClass", "in path:": "/opt/gpu-operator/pre-requisites"}
2022-04-05T14:09:39.957Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "PodSecurityPolicy", "in path:": "/opt/gpu-operator/pre-requisites"}
2022-04-05T14:09:39.959Z	INFO	controllers.ClusterPolicy	Getting assets from: 	{"path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.959Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.960Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Role", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.961Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ClusterRole", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.961Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.962Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ClusterRoleBinding", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.963Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ConfigMap", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.963Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.964Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-driver"}
2022-04-05T14:09:39.971Z	INFO	controllers.ClusterPolicy	Getting assets from: 	{"path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.971Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.971Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Role", "in path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.971Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.972Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-container-toolkit"}
2022-04-05T14:09:39.972Z	INFO	controllers.ClusterPolicy	Getting assets from: 	{"path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Role", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ClusterRole", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.973Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ClusterRoleBinding", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.974Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.974Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-operator-validation"}
2022-04-05T14:09:39.975Z	INFO	controllers.ClusterPolicy	Getting assets from: 	{"path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.975Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.976Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Role", "in path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.976Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.976Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-device-plugin"}
2022-04-05T14:09:39.977Z	INFO	controllers.ClusterPolicy	Getting assets from: 	{"path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.977Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.978Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Role", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.978Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.978Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Role", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.978Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.979Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Service", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.980Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ServiceMonitor", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.982Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ConfigMap", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.982Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.983Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-monitoring"}
2022-04-05T14:09:39.983Z	INFO	controllers.ClusterPolicy	Getting assets from: 	{"path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.984Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.984Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Role", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.984Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.985Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.985Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
2022-04-05T14:09:39.986Z	INFO	controllers.ClusterPolicy	Getting assets from: 	{"path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.986Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "Role", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ClusterRole", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ClusterRoleBinding", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.987Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "ConfigMap", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.988Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "SecurityContextConstraints", "in path:": "/opt/gpu-operator/state-mig-manager"}
2022-04-05T14:09:39.988Z	INFO	controllers.ClusterPolicy	DEBUG: Looking for 	{"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-mig-manager"}
time="2022-04-05T14:09:39Z" level=info msg="Checking GPU state labels on the nodeNodeNameip-10-111-2-106.ec2.internal" source="state_manager.go:236"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.drivervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.gpu-feature-discoveryvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.container-toolkitvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.device-pluginvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.dcgm-exportervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.operator-validatorvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg="Checking GPU state labels on the nodeNodeNameip-10-111-35-191.ec2.internal" source="state_manager.go:236"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.dcgm-exportervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.operator-validatorvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.drivervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.gpu-feature-discoveryvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.container-toolkitvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.device-pluginvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg="Checking GPU state labels on the nodeNodeNameip-10-111-6-113.ec2.internal" source="state_manager.go:236"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.drivervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.gpu-feature-discoveryvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.container-toolkitvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.device-pluginvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.dcgm-exportervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.operator-validatorvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg="Checking GPU state labels on the nodeNodeNameip-10-111-5-155.ec2.internal" source="state_manager.go:236"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.drivervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.gpu-feature-discoveryvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.container-toolkitvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.device-pluginvaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.dcgm-exportervaluetrue" source="state_manager.go:137"
time="2022-04-05T14:09:39Z" level=info msg=" - Labelnvidia.com/gpu.deploy.operator-validatorvaluetrue" source="state_manager.go:137"
2022-04-05T14:09:40.020Z	INFO	controllers.ClusterPolicy	Found Resource	{"Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.029Z	INFO	controllers.ClusterPolicy	Found Resource	{"RuntimeClass": "nvidia"}
2022-04-05T14:09:40.041Z	INFO	controllers.ClusterPolicy	Found Resource	{"ServiceAccount": "nvidia-driver", "Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.050Z	INFO	controllers.ClusterPolicy	Found Resource	{"Role": "nvidia-driver", "Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.059Z	INFO	controllers.ClusterPolicy	Found Resource	{"ClusterRole": "nvidia-driver", "Namespace": ""}
2022-04-05T14:09:40.069Z	INFO	controllers.ClusterPolicy	Found Resource	{"RoleBinding": "nvidia-driver", "Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.079Z	INFO	controllers.ClusterPolicy	Found Resource	{"ClusterRoleBinding": "nvidia-driver", "Namespace": ""}
2022-04-05T14:09:40.088Z	INFO	controllers.ClusterPolicy	Found Resource	{"ConfigMap": "nvidia-driver", "Namespace": "gpu-operator-resources"}
2022-04-05T14:09:40.099Z	INFO	controllers.ClusterPolicy	Found Resource	{"SecurityContextConstraints": "nvidia-driver", "Namespace": "default"}
2022-04-05T14:09:40.099Z	INFO	controllers.ClusterPolicy	4.18.0-193.47.1.el8_2.x86_64	{"Request.Namespace": "default", "Request.Name": "Node"}

smithbk avatar Apr 05 '22 14:04 smithbk

I don't know what can be going wrong here, we installed together the GPU Operator v1.7.1 from OperatorHub, things were smooth after we solved the issue of https://github.com/NVIDIA/gpu-operator/issues/330,

but I don't know why the operator is crashing hard and silently like that :/

for reference, here is a valid log of the GPU Operator v1.7.1 on OCP 4.6: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-rh-ecosystem-edge-ci-artifacts-master-4.6-nvidia-gpu-operator-e2e-1-7-0/1511116673678053376/artifacts/nvidia-gpu-operator-e2e-1-7-0/nightly/artifacts/012__gpu_operator__capture_deployment_state/gpu_operator.log

kpouget avatar Apr 05 '22 15:04 kpouget

@kpouget Kevin, do you know who might be able to help with this? Thanks

smithbk avatar Apr 05 '22 20:04 smithbk

@smithbk can you describe the operator Pod?

we didn't see that together

gpu-operator-566644fc46-2znxj   0/1     OOMKilled   5          6m27s

but likely this is the reason why the operator is crashing without any error message

@shivamerla do you remember a memory issue on 1.7.1, with 4 GPU nodes?

I see this in the Pod spec:

                resources:
                  limits:
                    cpu: 500m
                    memory: 250Mi
                  requests:
                    cpu: 200m
                    memory: 100Mi

kpouget avatar Apr 06 '22 06:04 kpouget

@kpouget @shivamerla Here is the pod description

$ oc describe pod gpu-operator-566644fc46-2znxj
Name:                 gpu-operator-566644fc46-2znxj
Namespace:            openshift-operators
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-111-61-177.ec2.internal/10.111.61.177
Start Time:           Tue, 05 Apr 2022 10:08:24 -0400
Labels:               app.kubernetes.io/component=gpu-operator
                      name=gpu-operator
                      pod-template-hash=566644fc46
Annotations:          alm-examples:
                        [
                          {
                            "apiVersion": "nvidia.com/v1",
                            "kind": "ClusterPolicy",
                            "metadata": {
                              "name": "gpu-cluster-policy"
                            },
                            "spec": {
                              "dcgmExporter": {
                                "affinity": {},
                                "image": "dcgm-exporter",
                                "imagePullSecrets": [],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.dcgm-exporter": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia/k8s",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:8af02463a8b60b21202d0bf69bc1ee0bb12f684fa367f903d138df6cacc2d0ac"
                              },
                              "devicePlugin": {
                                "affinity": {},
                                "image": "k8s-device-plugin",
                                "imagePullSecrets": [],
                                "args": [],
                                "env": [
                                  {
                                    "name": "PASS_DEVICE_SPECS",
                                    "value": "true"
                                  },
                                  {
                                    "name": "FAIL_ON_INIT_ERROR",
                                    "value": "true"
                                  },
                                  {
                                    "name": "DEVICE_LIST_STRATEGY",
                                    "value": "envvar"
                                  },
                                  {
                                    "name": "DEVICE_ID_STRATEGY",
                                    "value": "uuid"
                                  },
                                  {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "all"
                                  },
                                  {
                                    "name": "NVIDIA_DRIVER_CAPABILITIES",
                                    "value": "all"
                                  }
                                ],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.device-plugin": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:85def0197f388e5e336b1ab0dbec350816c40108a58af946baa1315f4c96ee05"
                              },
                              "driver": {
                                "enabled": true,
                                "affinity": {},
                                "image": "driver",
                                "imagePullSecrets": [],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.driver": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "repoConfig": {
                                  "configMapName": "",
                                  "destinationDir": ""
                                },
                                "licensingConfig": {
                                  "configMapName": ""
                                },
                                "version": "sha256:09ba3eca64a80fab010a9fcd647a2675260272a8c3eb515dfed6dc38a2d31ead"
                              },
                              "gfd": {
                                "affinity": {},
                                "image": "gpu-feature-discovery",
                                "imagePullSecrets": [],
                                "env": [
                                  {
                                    "name": "GFD_SLEEP_INTERVAL",
                                    "value": "60s"
                                  },
                                  {
                                    "name": "FAIL_ON_INIT_ERROR",
                                    "value": "true"
                                  }
                                ],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.gpu-feature-discovery": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:bfc39d23568458dfd50c0c5323b6d42bdcd038c420fb2a2becd513a3ed3be27f"
                              },
                              "migManager": {
                                "enabled": true,
                                "affinity": {},
                                "image": "k8s-mig-manager",
                                "imagePullSecrets": [],
                                "env": [
                                  {
                                    "name": "WITH_REBOOT",
                                    "value": "false"
                                  }
                                ],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.mig-manager": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia/cloud-native",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:495ed3b42e0541590c537ab1b33bda772aad530d3ef6a4f9384d3741a59e2bf8"
                              },
                              "operator": {
                                "defaultRuntime": "crio",
                                "deployGFD": true,
                                "initContainer": {
                                  "image": "cuda",
                                  "repository": "nvcr.io/nvidia",
                                  "version": "sha256:15674e5c45c97994bc92387bad03a0d52d7c1e983709c471c4fecc8e806dbdce",
                                  "imagePullSecrets": []
                                }
                              },
                              "mig": {
                                "strategy": "single"
                              },
                              "toolkit": {
                                "enabled": true,
                                "affinity": {},
                                "image": "container-toolkit",
                                "imagePullSecrets": [],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.container-toolkit": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia/k8s",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:ffa284f1f359d70f0e1d6d8e7752d7c92ef7445b0d74965a8682775de37febf8"
                              },
                              "validator": {
                                "affinity": {},
                                "image": "gpu-operator-validator",
                                "imagePullSecrets": [],
                                "nodeSelector": {
                                  "nvidia.com/gpu.deploy.operator-validator": "true"
                                },
                                "podSecurityContext": {},
                                "repository": "nvcr.io/nvidia/cloud-native",
                                "resources": {},
                                "securityContext": {},
                                "tolerations": [],
                                "priorityClassName": "system-node-critical",
                                "version": "sha256:aa1f7bd526ae132c46f3ebe6ecfabe675889e240776ccc2155e31e0c48cc659e",
                                "env": [
                                  {
                                    "name": "WITH_WORKLOAD",
                                    "value": "true"
                                  }
                                ]
                              }
                            }
                          }
                        ]
                      capabilities: Basic Install
                      categories: AI/Machine Learning, OpenShift Optional
                      certified: true
                      cni.projectcalico.org/containerID: aa562b5de68796f144d43e698477d85a889705ce4db6df7dff95e20f82194464
                      cni.projectcalico.org/podIP: 172.27.15.52/32
                      cni.projectcalico.org/podIPs: 172.27.15.52/32
                      containerImage: nvcr.io/nvidia/gpu-operator:v1.7.1
                      createdAt: Wed Jun 16 06:51:51 PDT 2021
                      description: Automate the management and monitoring of NVIDIA GPUs.
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "",
                            "interface": "eth0",
                            "ips": [
                                "172.27.15.52"
                            ],
                            "mac": "86:f1:9f:e8:4f:fe",
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "",
                            "interface": "eth0",
                            "ips": [
                                "172.27.15.52"
                            ],
                            "mac": "86:f1:9f:e8:4f:fe",
                            "default": true,
                            "dns": {}
                        }]
                      olm.operatorGroup: global-operators
                      olm.operatorNamespace: openshift-operators
                      olm.targetNamespaces: 
                      openshift.io/scc: hostmount-anyuid
                      operatorframework.io/properties:
                        {"properties":[{"type":"olm.gvk","value":{"group":"nvidia.com","kind":"ClusterPolicy","version":"v1"}},{"type":"olm.package","value":{"pac...
                      operators.openshift.io/infrastructure-features: ["Disconnected"]
                      operators.operatorframework.io/builder: operator-sdk-v1.4.0
                      operators.operatorframework.io/project_layout: go.kubebuilder.io/v3
                      provider: NVIDIA
                      repository: http://github.com/NVIDIA/gpu-operator
                      support: NVIDIA
Status:               Running
IP:                   172.27.15.52
IPs:
  IP:           172.27.15.52
Controlled By:  ReplicaSet/gpu-operator-566644fc46
Containers:
  gpu-operator:
    Container ID:  cri-o://8f8e24b1c06329b3a19a218408c2ed4787c2d19b7babde6d2d5aceace96324b3
    Image:         nvcr.io/nvidia/gpu-operator@sha256:3a812cf113f416baca9262fa8423f36141f35696eb6e7a51a7abb40f5ccd5f8c
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:3a812cf113f416baca9262fa8423f36141f35696eb6e7a51a7abb40f5ccd5f8c
    Port:          <none>
    Host Port:     <none>
    Command:
      gpu-operator
    Args:
      --leader-elect
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 06 Apr 2022 08:06:16 -0400
      Finished:     Wed, 06 Apr 2022 08:06:48 -0400
    Ready:          False
    Restart Count:  239
    Limits:
      cpu:     500m
      memory:  250Mi
    Requests:
      cpu:      200m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      HTTP_PROXY:   http://proxy-app.discoverfinancial.com:8080
      HTTPS_PROXY:  http://proxy-app.discoverfinancial.com:8080
      NO_PROXY:     .artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.aws.discoverfinancial.com,.cluster.local,.discoverfinancial.com,.ec2.internal,.na.discoverfinancial.com,.ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.ocp.aws.discoverfinancial.com,.ocpdev.us-east-1.ac.discoverfinancial.com,.prdops3-app.ocp.aws.discoverfinancial.com,.rw.discoverfinancial.com,.svc,10.0.0.0/8,10.111.0.0/16,127.0.0.1,169.254.169.254,172.23.0.0/16,172.24.0.0/14,api-int.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,artifactory.prdops3-app.ocp.aws.discoverfinancial.com,aws.discoverfinancial.com,discoverfinancial.com,ec2.internal,etcd-0.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,etcd-1.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,etcd-2.aws-useast1-apps-lab-1.ocpdev.us-east-1.ac.discoverfinancial.com,localhost,na.discoverfinancial.com,ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,ocp.aws.discoverfinancial.com,ocpdev.us-east-1.ac.discoverfinancial.com,prdops3-app.ocp.aws.discoverfinancial.com,rw.discoverfinancial.com
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from gpu-operator-token-2w6p4 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  gpu-operator-token-2w6p4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  gpu-operator-token-2w6p4
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                     From     Message
  ----     ------          ----                    ----     -------
  Normal   AddedInterface  132m                    multus   Add eth0 [172.27.15.27/32]
  Warning  Unhealthy       120m                    kubelet  Readiness probe failed: Get "http://172.27.15.27:8081/readyz": dial tcp 172.27.15.27:8081: connect: connection refused
  Normal   Pulled          70m (x227 over 21h)     kubelet  Container image "nvcr.io/nvidia/gpu-operator@sha256:3a812cf113f416baca9262fa8423f36141f35696eb6e7a51a7abb40f5ccd5f8c" already present on machine
  Normal   AddedInterface  69m                     multus   Add eth0 [172.27.15.52/32]
  Warning  Unhealthy       30m                     kubelet  Liveness probe failed: Get "http://172.27.15.52:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff         5m18s (x5622 over 21h)  kubelet  Back-off restarting failed container

smithbk avatar Apr 06 '22 12:04 smithbk

still this,

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled

but I expected to see more things in the Event logs ... :/

can you check if your node ip-10-111-61-177.ec2.internal/10.111.61.177 isn't running full of memory?

kpouget avatar Apr 06 '22 12:04 kpouget

@kpouget Looks OK to me. If there is some other way of checking, let me know.

$ oc describe node ip-10-111-61-177.ec2.internal
Name:               ip-10-111-61-177.ec2.internal
Roles:              infra,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5a.2xlarge
                    beta.kubernetes.io/os=linux
                    contact=OCPEngineers
                    cost_center=458690
                    enterprise.discover.com/cluster-id=aws-useast1-apps-lab-r2jkd
                    enterprise.discover.com/cluster-name=aws-useast1-apps-lab-1
                    enterprise.discover.com/cost_center=458690
                    enterprise.discover.com/data-classification=na
                    enterprise.discover.com/environment=lab
                    enterprise.discover.com/freedom=false
                    enterprise.discover.com/gdpr=false
                    enterprise.discover.com/openshift=true
                    enterprise.discover.com/openshift-role=worker
                    enterprise.discover.com/pci=false
                    enterprise.discover.com/product=common
                    enterprise.discover.com/public=false
                    enterprise.discover.com/support-assignment-group=OCPEngineering
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1d
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/custom-rdma.available=true
                    feature.node.kubernetes.io/kernel-selinux.enabled=true
                    feature.node.kubernetes.io/kernel-version.full=4.18.0-193.47.1.el8_2.x86_64
                    feature.node.kubernetes.io/kernel-version.major=4
                    feature.node.kubernetes.io/kernel-version.minor=18
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-1d0f.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=rhcos
                    feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.6
                    feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.2
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=4.6
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=6
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-111-61-177
                    kubernetes.io/os=linux
                    machine.openshift.io/cluster-api-cluster=aws-useast1-apps-lab-1
                    machine.openshift.io/cluster-api-cluster-name=aws-useast1-apps-lab-1
                    machine.openshift.io/cluster-api-machine-role=worker
                    machine.openshift.io/cluster-api-machineset=infra-1d
                    machine.openshift.io/cluster-api-machineset-group=infra
                    machine.openshift.io/cluster-api-machineset-ha=1d
                    node-role.kubernetes.io/infra=
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m5a.2xlarge
                    node.openshift.io/os_id=rhcos
                    route-reflector=true
                    topology.ebs.csi.aws.com/zone=us-east-1d
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1d
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0fc5da74c55fd897c"}
                    machine.openshift.io/machine: openshift-machine-api/infra-1d-rvc9x
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-3a01af8a0304107341810791e3b3ad99
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-3a01af8a0304107341810791e3b3ad99
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.SHA,cpu-cpuid.SSE4A,cpu-hardware_multithreading,custom...
                    nfd.node.kubernetes.io/worker.version: 1.15
                    projectcalico.org/IPv4Address: 10.111.61.177/20
                    projectcalico.org/IPv4IPIPTunnelAddr: 172.27.15.0
                    projectcalico.org/RouteReflectorClusterID: 1.0.0.1
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 24 Jan 2022 17:04:22 -0500
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-111-61-177.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 06 Apr 2022 11:57:41 -0400
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 17 Feb 2022 15:17:07 -0500   Thu, 17 Feb 2022 15:17:07 -0500   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Wed, 06 Apr 2022 11:54:51 -0400   Mon, 24 Jan 2022 17:04:22 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 06 Apr 2022 11:54:51 -0400   Mon, 24 Jan 2022 17:04:22 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 06 Apr 2022 11:54:51 -0400   Mon, 24 Jan 2022 17:04:22 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 06 Apr 2022 11:54:51 -0400   Mon, 24 Jan 2022 17:05:32 -0500   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.111.61.177
  Hostname:     ip-10-111-61-177.ec2.internal
  InternalDNS:  ip-10-111-61-177.ec2.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         8
  ephemeral-storage:           125277164Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      32288272Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         7500m
  ephemeral-storage:           120795883220
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      31034896Ki
  pods:                        250
System Info:
  Machine ID:                             ec29f9293380ea1eceab3523cbbd2b2a
  System UUID:                            ec29f929-3380-ea1e-ceab-3523cbbd2b2a
  Boot ID:                                89e3a344-ba71-4882-8b39-97738890d719
  Kernel Version:                         4.18.0-193.47.1.el8_2.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 46.82.202104170019-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.19.1-11.rhaos4.6.git050df4c.el8
  Kubelet Version:                        v1.19.0+a5a0987
  Kube-Proxy Version:                     v1.19.0+a5a0987
ProviderID:                               aws:///us-east-1d/i-0fc5da74c55fd897c
Non-terminated Pods:                      (33 in total)
  Namespace                               Name                                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                                          ------------  ----------  ---------------  -------------  ---
  calico-system                           calico-node-wfr7j                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         47d
  eng-attempt48                           eventbus-default-stan-0                                       200m (2%)     400m (5%)   262144k (0%)     2Gi (6%)       35h
  gremlin                                 gremlin-pgxb4                                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         71d
  instana-agent                           instana-agent-x4snr                                           600m (8%)     2 (26%)     2112Mi (6%)      2Gi (6%)       20m
  kube-system                             istio-cni-node-vskdr                                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         71d
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-2z9sj                                 30m (0%)      0 (0%)      150Mi (0%)       0 (0%)         71d
  openshift-cluster-node-tuning-operator  tuned-49mnk                                                   10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         71d
  openshift-compliance                    dfs-ocp4-cis-node-worker-ip-10-111-61-177.ec2.internal-pod    20m (0%)      200m (2%)   70Mi (0%)        600Mi (1%)     19d
  openshift-compliance                    ocp4-cis-node-worker-ip-10-111-61-177.ec2.internal-pod        20m (0%)      200m (2%)   70Mi (0%)        600Mi (1%)     19d
  openshift-dns                           dns-default-b4q5z                                             65m (0%)      0 (0%)      110Mi (0%)       512Mi (1%)     19d
  openshift-image-registry                node-ca-rfx9v                                                 10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         71d
  openshift-ingress                       router-default-55c779749d-5g9l5                               200m (2%)     0 (0%)      512Mi (1%)       0 (0%)         71d
  openshift-kube-proxy                    openshift-kube-proxy-8lr6h                                    100m (1%)     0 (0%)      200Mi (0%)       0 (0%)         19d
  openshift-machine-config-operator       machine-config-daemon-7kmlj                                   40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         71d
  openshift-marketplace                   opencloud-operators-p8vss                                     10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         37h
  openshift-monitoring                    node-exporter-ttdb5                                           9m (0%)       0 (0%)      210Mi (0%)       0 (0%)         71d
  openshift-monitoring                    prometheus-adapter-6b47cfbf98-rvgnt                           1m (0%)       0 (0%)      25Mi (0%)        0 (0%)         2d16h
  openshift-monitoring                    prometheus-operator-68d689dccc-t6rzm                          6m (0%)       0 (0%)      100Mi (0%)       0 (0%)         3d16h
  openshift-multus                        multus-594h4                                                  10m (0%)      0 (0%)      150Mi (0%)       0 (0%)         19d
  openshift-multus                        network-metrics-daemon-5ngdr                                  20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         19d
  openshift-nfd                           nfd-worker-8r252                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  openshift-node                          splunk-rjhk7                                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         71d
  openshift-operators                     gpu-operator-566644fc46-2znxj                                 200m (2%)     500m (6%)   100Mi (0%)       250Mi (0%)     25h
  openshift-operators                     nfd-worker-qcf7l                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         39d
  postgresql-operator                     postgresql-operator-79f8644dd9-krcfb                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         45h
  sample-project                          mongodb-1-2n98n                                               0 (0%)        0 (0%)      512Mi (1%)       512Mi (1%)     55d
  skunkworks                              backstage-67fc9f9b45-cx4x8                                    350m (4%)     700m (9%)   576Mi (1%)       1152Mi (3%)    42h
  sysdig-agent                            sysdig-agent-fw94l                                            1 (13%)       2 (26%)     512Mi (1%)       1536Mi (5%)    37s
  sysdig-agent                            sysdig-image-analyzer-8xwvw                                   250m (3%)     500m (6%)   512Mi (1%)       1536Mi (5%)    38s
  sysdig-agent                            sysdig-image-analyzer-xpt4q                                   250m (3%)     500m (6%)   512Mi (1%)       1536Mi (5%)    14h
  tigera-compliance                       compliance-benchmarker-br5xl                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         47d
  tigera-fluentd                          fluentd-node-qzpxf                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         47d
  vault-secrets-operator                  vault-secrets-operator-controller-7598f4bd5f-4cfdc            2 (26%)       2 (26%)     2Gi (6%)         2Gi (6%)       26s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests          Limits
  --------                    --------          ------
  cpu                         5401m (72%)       9 (120%)
  memory                      9501147136 (29%)  14378Mi (47%)
  ephemeral-storage           0 (0%)            0 (0%)
  hugepages-1Gi               0 (0%)            0 (0%)
  hugepages-2Mi               0 (0%)            0 (0%)
  attachable-volumes-aws-ebs  0                 0
Events:                       <none>

smithbk avatar Apr 06 '22 16:04 smithbk

@kpouget Any other ideas of what to check, or someone else who would know? Thanks

smithbk avatar Apr 06 '22 20:04 smithbk

@smithbk @kpouget Yes, i do remember this happening where momentarily GPU operator memory usage spikes on OCP. We are yet to identify cause for that. we can edit the CSV/Operator Deployment spec to allow following limits

                resources:
                  limits:
                    cpu: 500m
                    memory: 1Gi
                  requests:
                    cpu: 200m
                    memory: 200Mi

shivamerla avatar Apr 06 '22 21:04 shivamerla

@kpouget The pod is running now but the cluster policy status is not progressing. Here is what I'm seeing now.

$ oc get pod -n openshift-operators | grep gpu-operator
gpu-operator-889b67578-r57p5                   1/1     Running       0          18m

Note the "ClusterPolicy step wasn't ready" messages below.

$ oc logs gpu-operator-889b67578-r57p5 -n openshift-operators --tail 50
2022-04-07T01:11:23.642Z	INFO	controllers.ClusterPolicy	Found Resource	{"ClusterRoleBinding": "nvidia-operator-validator", "Namespace": ""}
2022-04-07T01:11:23.654Z	INFO	controllers.ClusterPolicy	Found Resource	{"SecurityContextConstraints": "nvidia-operator-validator", "Namespace": "default"}
2022-04-07T01:11:23.664Z	INFO	controllers.ClusterPolicy	Found Resource	{"DaemonSet": "nvidia-operator-validator", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.664Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"LabelSelector": "app=nvidia-operator-validator"}
2022-04-07T01:11:23.664Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.664Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberUnavailable": 4}
2022-04-07T01:11:23.664Z	INFO	controllers.ClusterPolicy	ClusterPolicy step wasn't ready	{"State:": "notReady"}
2022-04-07T01:11:23.672Z	INFO	controllers.ClusterPolicy	Found Resource	{"ServiceAccount": "nvidia-device-plugin", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.680Z	INFO	controllers.ClusterPolicy	Found Resource	{"Role": "nvidia-device-plugin", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.689Z	INFO	controllers.ClusterPolicy	Found Resource	{"RoleBinding": "nvidia-device-plugin", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.703Z	INFO	controllers.ClusterPolicy	Found Resource	{"DaemonSet": "nvidia-device-plugin-daemonset", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.703Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"LabelSelector": "app=nvidia-device-plugin-daemonset"}
2022-04-07T01:11:23.703Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.703Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberUnavailable": 4}
2022-04-07T01:11:23.703Z	INFO	controllers.ClusterPolicy	ClusterPolicy step wasn't ready	{"State:": "notReady"}
2022-04-07T01:11:23.712Z	INFO	controllers.ClusterPolicy	Found Resource	{"ServiceAccount": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.724Z	INFO	controllers.ClusterPolicy	Found Resource	{"Role": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.737Z	INFO	controllers.ClusterPolicy	Found Resource	{"RoleBinding": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.744Z	INFO	controllers.ClusterPolicy	Found Resource	{"Role": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.756Z	INFO	controllers.ClusterPolicy	Found Resource	{"RoleBinding": "prometheus-k8s", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.775Z	INFO	controllers.ClusterPolicy	Found Resource	{"Service": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.784Z	INFO	controllers.ClusterPolicy	Found Resource	{"ServiceMonitor": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.793Z	INFO	controllers.ClusterPolicy	Found Resource	{"ConfigMap": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.804Z	INFO	controllers.ClusterPolicy	Found Resource	{"SecurityContextConstraints": "nvidia-dcgm-exporter", "Namespace": "default"}
2022-04-07T01:11:23.804Z	INFO	controllers.ClusterPolicy	4.18.0-193.47.1.el8_2.x86_64	{"Request.Namespace": "default", "Request.Name": "Node"}
2022-04-07T01:11:23.814Z	INFO	controllers.ClusterPolicy	Found Resource	{"DaemonSet": "nvidia-dcgm-exporter", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.814Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"LabelSelector": "app=nvidia-dcgm-exporter"}
2022-04-07T01:11:23.814Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.814Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberUnavailable": 4}
2022-04-07T01:11:23.814Z	INFO	controllers.ClusterPolicy	ClusterPolicy step wasn't ready	{"State:": "notReady"}
2022-04-07T01:11:23.821Z	INFO	controllers.ClusterPolicy	Found Resource	{"ServiceAccount": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.828Z	INFO	controllers.ClusterPolicy	Found Resource	{"Role": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.839Z	INFO	controllers.ClusterPolicy	Found Resource	{"RoleBinding": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.850Z	INFO	controllers.ClusterPolicy	Found Resource	{"SecurityContextConstraints": "nvidia-gpu-feature-discovery", "Namespace": "default"}
2022-04-07T01:11:23.858Z	INFO	controllers.ClusterPolicy	Found Resource	{"DaemonSet": "gpu-feature-discovery", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.858Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"LabelSelector": "app=gpu-feature-discovery"}
2022-04-07T01:11:23.858Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.858Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberUnavailable": 4}
2022-04-07T01:11:23.858Z	INFO	controllers.ClusterPolicy	ClusterPolicy step wasn't ready	{"State:": "notReady"}
2022-04-07T01:11:23.866Z	INFO	controllers.ClusterPolicy	Found Resource	{"ServiceAccount": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.873Z	INFO	controllers.ClusterPolicy	Found Resource	{"Role": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.881Z	INFO	controllers.ClusterPolicy	Found Resource	{"ClusterRole": "nvidia-mig-manager", "Namespace": ""}
2022-04-07T01:11:23.891Z	INFO	controllers.ClusterPolicy	Found Resource	{"RoleBinding": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.909Z	INFO	controllers.ClusterPolicy	Found Resource	{"ClusterRoleBinding": "nvidia-mig-manager", "Namespace": ""}
2022-04-07T01:11:23.918Z	INFO	controllers.ClusterPolicy	Found Resource	{"ConfigMap": "mig-parted-config", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.942Z	INFO	controllers.ClusterPolicy	Found Resource	{"SecurityContextConstraints": "nvidia-driver", "Namespace": "default"}
2022-04-07T01:11:23.952Z	INFO	controllers.ClusterPolicy	Found Resource	{"DaemonSet": "nvidia-mig-manager", "Namespace": "gpu-operator-resources"}
2022-04-07T01:11:23.952Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"LabelSelector": "app=nvidia-mig-manager"}
2022-04-07T01:11:23.952Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberOfDaemonSets": 1}
2022-04-07T01:11:23.952Z	INFO	controllers.ClusterPolicy	DEBUG: DaemonSet	{"NumberUnavailable": 0}

The pods in gpu-operator-resources are failing.

$ oc get pod -n gpu-operator-resources
NAME                                       READY   STATUS             RESTARTS   AGE
gpu-feature-discovery-2k9fw                0/1     Init:0/1           0          15m
gpu-feature-discovery-7dwvv                0/1     Init:0/1           0          15m
gpu-feature-discovery-tgl5k                0/1     Init:0/1           0          15m
gpu-feature-discovery-vgwlp                0/1     Init:0/1           0          15m
nvidia-container-toolkit-daemonset-c5xck   0/1     Init:0/1           0          15m
nvidia-container-toolkit-daemonset-cc59r   0/1     Init:0/1           0          15m
nvidia-container-toolkit-daemonset-fppnr   0/1     Init:0/1           0          15m
nvidia-container-toolkit-daemonset-jc64m   0/1     Init:0/1           0          15m
nvidia-dcgm-exporter-gb7c4                 0/1     Init:0/2           0          15m
nvidia-dcgm-exporter-hm66s                 0/1     Init:0/2           0          15m
nvidia-dcgm-exporter-mqzzk                 0/1     Init:0/2           0          15m
nvidia-dcgm-exporter-msz6r                 0/1     Init:0/2           0          15m
nvidia-device-plugin-daemonset-cj6bs       0/1     Init:0/1           0          15m
nvidia-device-plugin-daemonset-kn6x6       0/1     Init:0/1           0          15m
nvidia-device-plugin-daemonset-lktnb       0/1     Init:0/1           0          15m
nvidia-device-plugin-daemonset-lv6hx       0/1     Init:0/1           0          15m
nvidia-driver-daemonset-f8g6d              0/1     CrashLoopBackOff   7          15m
nvidia-driver-daemonset-hjvgl              0/1     CrashLoopBackOff   7          15m
nvidia-driver-daemonset-vb85p              0/1     CrashLoopBackOff   7          15m
nvidia-driver-daemonset-xj4tk              0/1     CrashLoopBackOff   7          15m
nvidia-operator-validator-pzp8s            0/1     Init:0/4           0          15m
nvidia-operator-validator-rd6cq            0/1     Init:0/4           0          15m
nvidia-operator-validator-t7n5z            0/1     Init:0/4           0          15m
nvidia-operator-validator-wzgp9            0/1     Init:0/4           0          15m
$ oc logs nvidia-driver-daemonset-f8g6d -n gpu-operator-resources
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=460.73.01
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ RESOLVE_OCP_VERSION=true
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-193.47.1.el8_2.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ _resolve_rhel_version
+ '[' -f /host-etc/os-release ']'
+ echo 'Resolving RHEL version...'
Resolving RHEL version...
+ local version=
++ cat /host-etc/os-release
++ sed -e 's/^"//' -e 's/"$//'
++ awk -F= '{print $2}'
++ grep '^ID='
+ local id=rhcos
+ '[' rhcos = rhcos ']'
++ grep RHEL_VERSION
++ awk -F= '{print $2}'
++ sed -e 's/^"//' -e 's/"$//'
++ cat /host-etc/os-release
+ version=8.2
+ '[' -z 8.2 ']'
+ RHEL_VERSION=8.2
+ echo 'Proceeding with RHEL version 8.2'
Proceeding with RHEL version 8.2
+ return 0
+ _resolve_ocp_version
+ '[' true = true ']'
++ jq '.items[].status.desired.version'
++ sed -e 's/^"//' -e 's/"$//'
++ awk -F. '{printf("%d.%d\n", $1, $2)}'
++ kubectl get clusterversion -o json
Unable to connect to the server: Proxy Authentication Required
+ local version=
Resolving OpenShift version...
+ echo 'Resolving OpenShift version...'
+ '[' -z '' ']'
+ echo 'Could not resolve OpenShift version'
Could not resolve OpenShift version
+ return 1
+ exit 1

It seems that the root cause of this problem is the following, right?

++ kubectl get clusterversion -o json
Unable to connect to the server: Proxy Authentication Required

But this cluster is configured with a proxy.

$ oc get proxy
NAME      AGE
cluster   455d

Any ideas? Should I delete the cluster policy, delete the gpu-operator-resources namespace, and then recreate the cluster policy? I'm not sure if creation of the cluster policy recreates the gpu-operator-resources namespace or not.

smithbk avatar Apr 07 '22 01:04 smithbk

@kpouget It appears that kubectl does not recognize CIDR ranges in the no_proxy environment variable; therefore, it is trying to send the request through the proxy.
Perhaps adding a test case with a proxy would be good. Anyway, I added the appropriate IP to no_proxy and it is getting further, but is now failing as follows:

========== NVIDIA Software Installer ==========

+ echo -e 'Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-193.47.1.el8_2.x86_64\n'
Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-193.47.1.el8_2.x86_64

+ exec
+ flock -n 3
+ echo 1946547
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _kernel_requires_package
+ local proc_mount_arg=
+ echo 'Checking NVIDIA driver packages...'
Checking NVIDIA driver packages...
+ [[ ! -d /usr/src/nvidia-460.73.01/kernel ]]
+ cd /usr/src/nvidia-460.73.01/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-193.47.1.el8_2.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
Updating the package cache...
+ yum -q makecache
Error: Failed to download metadata for repo 'cuda': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
+ _shutdown

smithbk avatar Apr 07 '22 14:04 smithbk

@smithbk looks like access to cuda repository is blocked through proxy, can you check if developer.download.nvidia.com is blocked?

shivamerla avatar Apr 13 '22 14:04 shivamerla

Also, to test if driver can pull all repositories from container you can run

cat <<EOF > test-ca-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: trusted-ca
  labels:
    config.openshift.io/inject-trusted-cabundle: "true"
EOF
cat <<EOF > test-entitlements-proxy.yaml
apiVersion: v1
kind: Pod
metadata:
 name: entitlements-proxy
spec:
 containers:
   - name: cluster-entitled-build
     image: registry.access.redhat.com/ubi8:latest
     command: [ "/bin/sh", "-c", "dnf -d 5 search kernel-devel --showduplicates" ]
     env:
     - name: HTTP_PROXY
       value: ${HTTP_PROXY}
     - name: HTTPS_PROXY
       value: ${HTTPS_PROXY}
     - name: NO_PROXY
       value: ${NO_PROXY}
     volumeMounts:
     - name: trusted-ca
       mountPath: "/etc/pki/ca-trust/extracted/pem/"
       readOnly: true
 volumes:
 - name: trusted-ca
   configMap:
     name: trusted-ca
     items:
     - key: ca-bundle.crt
       path: tls-ca-bundle.pem
 restartPolicy: Never
EOF
oc apply -f test-ca-configmap.yaml  -f test-entitlements-proxy.yaml

You can get HTTP_PROXY HTTPS_PROXY and NO_PROXY values from cluster wide proxy oc describe proxy cluster

shivamerla avatar Apr 13 '22 14:04 shivamerla

@smithbk @kpouget Yes, i do remember this happening where momentarily GPU operator memory usage spikes on OCP. We are yet to identify cause for that. we can edit the CSV/Operator Deployment spec to allow following limits

                resources:
                  limits:
                    cpu: 500m
                    memory: 1Gi
                  requests:
                    cpu: 200m
                    memory: 200Mi

We hit memory issues on OCP after upgrading the nvidia operator recently. We were running under 1 Gi previously, and since then the operator pod hits over 2.5 Gi on startup. In the past as seen with other operators, it was normally when the operator was configured to list/watch objects with a cluster scope... in large clusters with many objects that means more data being returned to the operator. I don't know if thats the case for this operator, but I see it has clusterrolebindings. I did not dig into it further, we bumped up the memory again and it's working for now.

ctrought avatar Jul 20 '22 03:07 ctrought

thanks @ctrought i will work with Red Hat to understand this behavior on OCP. We are not seeing this with K8s. Operator does fetch all node labels at startup, but it should not consume that large memory momentarily.

shivamerla avatar Aug 08 '22 18:08 shivamerla