gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Nvidia GPU operator failing to install on OpenShift with dedicated rather than shared nodes

Open smithbk opened this issue 2 years ago • 12 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node?
  • [ ] Are you running Kubernetes v1.13+?
  • [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I am trying to install the NVIDIA GPU Operator on version 4.8.39 of OpenShift Container Platform and the gpu-operator pod is failing to start because "node(s) didn't match Pod's node affinity/selector". There are multiple GPU nodes but they all have the "enterprise.discover.com/dedicated: true" label rather than the "enterprise.discover.com/shared: true" label which is what the gpu-operator pod requires.

I opened https://access.redhat.com/support/cases/*/case/03265998 for this issue against Redhat but they said to reach out to Nvidia to determine how to edit the CRD or daemonset to add the correct nodeselector.

Can you tell me how to configure it to use the nodeselector for "enterprise.discover.com/dedicated: true" rather than "enterprise.discover.com/shared: true", or how else to make this work?

Thank you, Keith

2. Steps to reproduce the issue

Installed the Node Feature Discovery operator and then tried to install the NVIDIA GPU operator.

3. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status: kubectl get pods --all-namespaces

  • [ ] kubernetes daemonset status: kubectl get ds --all-namespaces

  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

  • [ ] Output of running a container on the GPU machine: docker run -it alpine echo foo

  • [ ] Docker configuration file: cat /etc/docker/daemon.json

  • [ ] Docker runtime configuration: docker info | grep runtime

  • [ ] NVIDIA shared directory: ls -la /run/nvidia

  • [ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

  • [ ] NVIDIA driver directory: ls -la /run/nvidia/driver

  • [ ] kubelet logs journalctl -u kubelet > kubelet.logs

smithbk avatar Jul 15 '22 07:07 smithbk

@smithbk by default gpu-operator pod deployed through OLM doesn't have any specific nodeSelector/tolerations. Did you add the nodeSelector by editing the CSV? Can you get the podSpec for operator Deployment as well as node taints you have setup?

shivamerla avatar Jul 15 '22 22:07 shivamerla

@shivamerla No, I didn't edit anything. I'm not exactly sure where I would edit the CSV. The entire pod yaml is below, but the gpu-operator deployment yaml does not contain a nodeSelector. So I am not sure what is inserting the nodeSelector into the pod spec. Any ideas? What else can I provide?

Thanks

kind: Pod
apiVersion: v1
metadata:
  generateName: gpu-operator-78b97d8f98-
  annotations:
    openshift.io/scc: hostmount-anyuid
    operators.operatorframework.io/builder: operator-sdk-v1.4.0
    operators.operatorframework.io/project_layout: go.kubebuilder.io/v3
    certified: 'true'
    olm.targetNamespaces: nvidia-gpu-operator
    operatorframework.io/properties: >-
      {"properties":[{"type":"olm.gvk","value":{"group":"nvidia.com","kind":"ClusterPolicy","version":"v1"}},{"type":"olm.package","value":{"packageName":"gpu-operator-certified","version":"1.9.1"}}]}
    repository: 'http://github.com/NVIDIA/gpu-operator'
    support: NVIDIA
    provider: NVIDIA
    operators.openshift.io/infrastructure-features: '["Disconnected"]'
    alm-examples: |-
      [
        {
          "apiVersion": "nvidia.com/v1",
          "kind": "ClusterPolicy",
          "metadata": {
            "name": "gpu-cluster-policy"
          },
          "spec": {
            "dcgmExporter": {
              "config": {
                "name": ""
              }
            },
            "dcgm": {
              "enabled": true
            },
            "daemonsets": {
            },
            "devicePlugin": {
            },
            "driver": {
              "enabled": true,
              "use_ocp_driver_toolkit": true,
              "repoConfig": {
                "configMapName": ""
              },
              "certConfig": {
                "name": ""
              },
              "licensingConfig": {
                "nlsEnabled": false,
                "configMapName": ""
              },
              "virtualTopology": {
                "config": ""
              }
            },
            "gfd": {
            },
            "migManager": {
              "enabled": true
            },
            "nodeStatusExporter": {
              "enabled": true
            },
            "operator": {
              "defaultRuntime": "crio",
              "deployGFD": true,
              "initContainer": {
              }
            },
            "mig": {
              "strategy": "single"
            },
            "toolkit": {
              "enabled": true
            },
            "validator": {
              "plugin": {
                "env": [
                  {
                    "name": "WITH_WORKLOAD",
                    "value": "true"
                  }
                ]
              }
            }
          }
        }
      ]
    capabilities: Basic Install
    olm.operatorNamespace: nvidia-gpu-operator
    containerImage: >-
      nvcr.io/nvidia/gpu-operator@sha256:173639a16409d7aeba2f5ca7ccd6260ea621d280a24b8933089d6df0623bb657
    createdAt: 'Thu Jan 20 07:14:13 PST 2022'
    categories: 'AI/Machine Learning, OpenShift Optional'
    operatorframework.io/suggested-namespace: nvidia-gpu-operator
    description: Automate the management and monitoring of NVIDIA GPUs.
    olm.operatorGroup: nvidia-gpu-operator-fqxw8
  resourceVersion: '229373247'
  name: gpu-operator-78b97d8f98-ps87s
  uid: 8f2ac191-2659-47ee-96cd-3d77c922bfbe
  creationTimestamp: '2022-07-13T15:37:10Z'
  managedFields:
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2022-07-13T15:37:10Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:annotations':
            'f:olm.operatorNamespace': {}
            'f:provider': {}
            'f:operators.openshift.io/infrastructure-features': {}
            'f:createdAt': {}
            'f:alm-examples': {}
            'f:description': {}
            'f:olm.operatorGroup': {}
            'f:capabilities': {}
            .: {}
            'f:containerImage': {}
            'f:categories': {}
            'f:operatorframework.io/suggested-namespace': {}
            'f:operators.operatorframework.io/project_layout': {}
            'f:certified': {}
            'f:operatorframework.io/properties': {}
            'f:operators.operatorframework.io/builder': {}
            'f:support': {}
            'f:olm.targetNamespaces': {}
            'f:repository': {}
          'f:generateName': {}
          'f:labels':
            .: {}
            'f:app': {}
            'f:app.kubernetes.io/component': {}
            'f:pod-template-hash': {}
          'f:ownerReferences':
            .: {}
            'k:{"uid":"60fd6652-827f-4cf4-b5e8-53610f4fd0a7"}':
              .: {}
              'f:apiVersion': {}
              'f:blockOwnerDeletion': {}
              'f:controller': {}
              'f:kind': {}
              'f:name': {}
              'f:uid': {}
        'f:spec':
          'f:volumes':
            .: {}
            'k:{"name":"host-os-release"}':
              .: {}
              'f:hostPath':
                .: {}
                'f:path': {}
                'f:type': {}
              'f:name': {}
          'f:containers':
            'k:{"name":"gpu-operator"}':
              'f:image': {}
              'f:volumeMounts':
                .: {}
                'k:{"mountPath":"/host-etc/os-release"}':
                  .: {}
                  'f:mountPath': {}
                  'f:name': {}
                  'f:readOnly': {}
              'f:terminationMessagePolicy': {}
              .: {}
              'f:resources':
                .: {}
                'f:limits':
                  .: {}
                  'f:cpu': {}
                  'f:memory': {}
                'f:requests':
                  .: {}
                  'f:cpu': {}
                  'f:memory': {}
              'f:args': {}
              'f:command': {}
              'f:livenessProbe':
                .: {}
                'f:failureThreshold': {}
                'f:httpGet':
                  .: {}
                  'f:path': {}
                  'f:port': {}
                  'f:scheme': {}
                'f:initialDelaySeconds': {}
                'f:periodSeconds': {}
                'f:successThreshold': {}
                'f:timeoutSeconds': {}
              'f:env':
                'k:{"name":"VALIDATOR_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"OPERATOR_NAMESPACE"}':
                  .: {}
                  'f:name': {}
                  'f:valueFrom':
                    .: {}
                    'f:fieldRef':
                      .: {}
                      'f:apiVersion': {}
                      'f:fieldPath': {}
                'k:{"name":"DCGM_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"MIG_MANAGER_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"DRIVER_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"DCGM_EXPORTER_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"DEVICE_PLUGIN_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"HTTPS_PROXY"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                .: {}
                'k:{"name":"CONTAINER_TOOLKIT_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"CUDA_BASE_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"GFD_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"NO_PROXY"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"DRIVER_MANAGER_IMAGE"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"HTTP_PROXY"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
                'k:{"name":"OPERATOR_CONDITION_NAME"}':
                  .: {}
                  'f:name': {}
                  'f:value': {}
              'f:readinessProbe':
                .: {}
                'f:failureThreshold': {}
                'f:httpGet':
                  .: {}
                  'f:path': {}
                  'f:port': {}
                  'f:scheme': {}
                'f:initialDelaySeconds': {}
                'f:periodSeconds': {}
                'f:successThreshold': {}
                'f:timeoutSeconds': {}
              'f:securityContext':
                .: {}
                'f:allowPrivilegeEscalation': {}
              'f:terminationMessagePath': {}
              'f:imagePullPolicy': {}
              'f:ports':
                .: {}
                'k:{"containerPort":8080,"protocol":"TCP"}':
                  .: {}
                  'f:containerPort': {}
                  'f:name': {}
                  'f:protocol': {}
              'f:name': {}
          'f:dnsPolicy': {}
          'f:priorityClassName': {}
          'f:serviceAccount': {}
          'f:restartPolicy': {}
          'f:schedulerName': {}
          'f:terminationGracePeriodSeconds': {}
          'f:serviceAccountName': {}
          'f:enableServiceLinks': {}
          'f:securityContext': {}
    - manager: kube-scheduler
      operation: Update
      apiVersion: v1
      time: '2022-07-13T15:37:10Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:conditions':
            .: {}
            'k:{"type":"PodScheduled"}':
              .: {}
              'f:lastProbeTime': {}
              'f:lastTransitionTime': {}
              'f:message': {}
              'f:reason': {}
              'f:status': {}
              'f:type': {}
  namespace: nvidia-gpu-operator
  ownerReferences:
    - apiVersion: apps/v1
      kind: ReplicaSet
      name: gpu-operator-78b97d8f98
      uid: 60fd6652-827f-4cf4-b5e8-53610f4fd0a7
      controller: true
      blockOwnerDeletion: true
  labels:
    app: gpu-operator
    app.kubernetes.io/component: gpu-operator
    pod-template-hash: 78b97d8f98
spec:
  nodeSelector:
    enterprise.discover.com/shared: 'true'
  restartPolicy: Always
  serviceAccountName: gpu-operator
  imagePullSecrets:
    - name: gpu-operator-dockercfg-7zzpq
  priority: 2000001000
  schedulerName: default-scheduler
  enableServiceLinks: true
  terminationGracePeriodSeconds: 10
  preemptionPolicy: PreemptLowerPriority
  securityContext:
    seLinuxOptions:
      level: 's0:c31,c5'
  containers:
    - resources:
        limits:
          cpu: 500m
          memory: 1Gi
        requests:
          cpu: 200m
          memory: 200Mi
      readinessProbe:
        httpGet:
          path: /readyz
          port: 8081
          scheme: HTTP
        initialDelaySeconds: 5
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      name: gpu-operator
      command:
        - gpu-operator
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8081
          scheme: HTTP
        initialDelaySeconds: 15
        timeoutSeconds: 1
        periodSeconds: 20
        successThreshold: 1
        failureThreshold: 3
      env:
        - name: OPERATOR_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: VALIDATOR_IMAGE
          value: >-
            nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:d7e0397249cd5099046506f32841535ea4f329f7b7583a6ddd9f75ff0f53385e
        - name: GFD_IMAGE
          value: >-
            nvcr.io/nvidia/gpu-feature-discovery@sha256:bfc39d23568458dfd50c0c5323b6d42bdcd038c420fb2a2becd513a3ed3be27f
        - name: CONTAINER_TOOLKIT_IMAGE
          value: >-
            nvcr.io/nvidia/k8s/container-toolkit@sha256:5f826e306d093332a86afa7d1e96218b5bdda8d33067931cff7914f6bb2994ee
        - name: DCGM_IMAGE
          value: >-
            nvcr.io/nvidia/cloud-native/dcgm@sha256:f4c4de8d66b2fef8cebaee6fec2fb2d15d01e835de2654df6dfd4a0ce0baec6b
        - name: DCGM_EXPORTER_IMAGE
          value: >-
            nvcr.io/nvidia/k8s/dcgm-exporter@sha256:8546a3e1ca8e642d2dbbfde13ee439e8137513be9886f9fe51f4fa9c4db80198
        - name: DEVICE_PLUGIN_IMAGE
          value: >-
            nvcr.io/nvidia/k8s-device-plugin@sha256:69171f906efe4bbabe31688343e59feea08a7e0ef8b0d9efb466abfa153aec16
        - name: DRIVER_IMAGE
          value: >-
            nvcr.io/nvidia/driver@sha256:c9af394fe78c02acbc2fae2c43fd73030ef54ab76156b1ae566eaed53f9b4835
        - name: DRIVER_MANAGER_IMAGE
          value: >-
            nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:54233ebccbc3d2b388b237031907d58c3719d0e6f3ecb874349c91e8145225d2
        - name: MIG_MANAGER_IMAGE
          value: >-
            nvcr.io/nvidia/cloud-native/k8s-mig-manager@sha256:d718dda2f9c9f0a465240772ed1ca6db44789d37255172e00637a092bdd1ba31
        - name: CUDA_BASE_IMAGE
          value: >-
            nvcr.io/nvidia/cuda@sha256:e137c897256501537e0986963889a91ec90cac029b5263fc4b229b278f5b1a02
        - name: HTTP_PROXY
          value: 'http://proxy-app.discoverfinancial.com:8080'
        - name: HTTPS_PROXY
          value: 'http://proxy-app.discoverfinancial.com:8080'
        - name: NO_PROXY
          value: >-
            .artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.aws.discoverfinancial.com,.cluster.local,.discoverfinancial.com,.ec2.internal,.na.discoverfinancial.com,.ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.ocp.aws.discoverfinancial.com,.ocpdev.us-east-1.ac.discoverfinancial.com,.prdops3-app.ocp.aws.discoverfinancial.com,.rw.discoverfinancial.com,.svc,10.0.0.0/8,10.111.0.0/16,127.0.0.1,169.254.169.254,172.23.0.0/16,172.24.0.0/14,api-int.aws-useast1-apps-lab-21.ocpdev.us-east-1.ac.discoverfinancial.com,artifactory.prdops3-app.ocp.aws.discoverfinancial.com,aws.discoverfinancial.com,discoverfinancial.com,ec2.internal,localhost,na.discoverfinancial.com,ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,ocp.aws.discoverfinancial.com,ocpdev.us-east-1.ac.discoverfinancial.com,prdops3-app.ocp.aws.discoverfinancial.com,rw.discoverfinancial.com
        - name: OPERATOR_CONDITION_NAME
          value: gpu-operator-certified.v1.9.1
      securityContext:
        capabilities:
          drop:
            - MKNOD
        allowPrivilegeEscalation: false
      ports:
        - name: metrics
          containerPort: 8080
          protocol: TCP
      imagePullPolicy: IfNotPresent
      volumeMounts:
        - name: host-os-release
          readOnly: true
          mountPath: /host-etc/os-release
        - name: kube-api-access-t6rwj
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePolicy: File
      image: >-
        nvcr.io/nvidia/gpu-operator@sha256:173639a16409d7aeba2f5ca7ccd6260ea621d280a24b8933089d6df0623bb657
      args:
        - '--leader-elect'
  serviceAccount: gpu-operator
  volumes:
    - name: host-os-release
      hostPath:
        path: /etc/os-release
        type: ''
    - name: kube-api-access-t6rwj
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
          - configMap:
              name: openshift-service-ca.crt
              items:
                - key: service-ca.crt
                  path: service-ca.crt
        defaultMode: 420
  dnsPolicy: ClusterFirst
  tolerations:
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/memory-pressure
      operator: Exists
      effect: NoSchedule
  priorityClassName: system-node-critical
status:
  phase: Pending
  conditions:
    - type: PodScheduled
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2022-07-13T15:37:10Z'
      reason: Unschedulable
      message: >-
        0/16 nodes are available: 13 node(s) didn't match Pod's node
        affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/master:
        }, that the pod didn't tolerate.
  qosClass: Burstable

smithbk avatar Jul 18 '22 16:07 smithbk

@smithbk Can you get the taints on the GPU nodes? We probably need to add those tolerations to the GPU Operator pod. This is done by editing CSV. oc get csv -n <nvidia-gpu-operator> and edit gpu-operator-certified CSV. We can also add following toleration to let in run on master node.

  tolerations:
  - key: "node-role.kubernetes.io/master"
    operator: "Equal"
    value: ""
    effect: "NoSchedule"

shivamerla avatar Jul 18 '22 17:07 shivamerla

@shivamerla There are no taints on the GPU nodes. For example, the following is the describe output for one of them:

Name:               ip-10-111-47-77.ec2.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=p3.2xlarge
                    beta.kubernetes.io/os=linux
                    contact=DNA-OCP
                    enterprise.discover.com/cluster-id=aws-useast1-apps-lab-7jw2f
                    enterprise.discover.com/cluster-name=aws-useast1-apps-lab-21
                    enterprise.discover.com/cost_center=458002
                    enterprise.discover.com/data-classification=na
                    enterprise.discover.com/dedicated=true
                    enterprise.discover.com/environment=lab
                    enterprise.discover.com/freedom=false
                    enterprise.discover.com/gdpr=false
                    enterprise.discover.com/openshift=true
                    enterprise.discover.com/openshift-role=worker
                    enterprise.discover.com/pci=false
                    enterprise.discover.com/product=datalake
                    enterprise.discover.com/public=false
                    enterprise.discover.com/support-assignment-group=DNA-OCP
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1c
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.HLE=true
                    feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
                    feature.node.kubernetes.io/cpu-cpuid.RTM=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE42=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-pstate.scaling_governor=performance
                    feature.node.kubernetes.io/cpu-pstate.status=active
                    feature.node.kubernetes.io/cpu-pstate.turbo=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true
                    feature.node.kubernetes.io/kernel-selinux.enabled=true
                    feature.node.kubernetes.io/kernel-version.full=4.18.0-305.45.1.el8_4.x86_64
                    feature.node.kubernetes.io/kernel-version.major=4
                    feature.node.kubernetes.io/kernel-version.minor=18
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-1013.present=true
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-1d0f.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=rhcos
                    feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.8
                    feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=48.84.202204202010-0
                    feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.4
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=4.8
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=8
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-111-47-77
                    kubernetes.io/os=linux
                    machine.openshift.io/cluster-api-cluster=aws-useast1-apps-lab-21
                    machine.openshift.io/cluster-api-cluster-name=aws-useast1-apps-lab-21
                    machine.openshift.io/cluster-api-machine-role=worker
                    machine.openshift.io/cluster-api-machineset=mrc-codes-1c
                    machine.openshift.io/cluster-api-machineset-group=mrc-codes
                    machine.openshift.io/cluster-api-machineset-ha=1c
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=p3.2xlarge
                    node.openshift.io/os_id=rhcos
                    topology.ebs.csi.aws.com/zone=us-east-1c
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1c
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0202e413d71c10150"}
                    machine.openshift.io/machine: openshift-machine-api/mrc-codes-1c-hpnm7
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-b84cfcfa061050187c89440b81894cef
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-b84cfcfa061050187c89440b81894cef
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.HLE,cpu-cpuid.HYPERVISOR,cpu-cpuid.RTM,cpu-cpuid.SSE4,...
                    nfd.node.kubernetes.io/worker.version: 1.16
                    projectcalico.org/IPv4Address: 10.111.47.77/20
                    projectcalico.org/IPv4IPIPTunnelAddr: 172.25.146.128
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 01 Jul 2022 14:52:10 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-111-47-77.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Tue, 19 Jul 2022 12:21:53 -0400
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 01 Jul 2022 14:53:13 -0400   Fri, 01 Jul 2022 14:53:13 -0400   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 19 Jul 2022 12:19:49 -0400   Fri, 01 Jul 2022 14:52:10 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 19 Jul 2022 12:19:49 -0400   Fri, 01 Jul 2022 14:52:10 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 19 Jul 2022 12:19:49 -0400   Fri, 01 Jul 2022 14:52:10 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 19 Jul 2022 12:19:49 -0400   Fri, 01 Jul 2022 14:53:10 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.111.47.77
  Hostname:     ip-10-111-47-77.ec2.internal
  InternalDNS:  ip-10-111-47-77.ec2.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         8
  ephemeral-storage:           125277164Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      62855724Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         7910m
  ephemeral-storage:           115455434152
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      58697421127680m
  pods:                        250
System Info:
  Machine ID:                             a839b8350a194b8b807ce2d9634313b2
  System UUID:                            ec28b13d-d2c8-64b5-0a39-b020a348b232
  Boot ID:                                e86e5f34-ff9c-4eb5-ac6c-367349fe2613
  Kernel Version:                         4.18.0-305.45.1.el8_4.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 48.84.202204202010-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.21.6-3.rhaos4.8.git19780ee.2.el8
  Kubelet Version:                        v1.21.8+ed4d8fd
  Kube-Proxy Version:                     v1.21.8+ed4d8fd
ProviderID:                               aws:///us-east-1c/i-0202e413d71c10150
Non-terminated Pods:                      (24 in total)
  Namespace                               Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                   ------------  ----------  ---------------  -------------  ---
  calico-system                           calico-node-n4rg6                      1 (12%)       0 (0%)      1536Mi (2%)      0 (0%)         17d
  centralized-logging-solution            cls-filebeat-vg4bk                     200m (2%)     0 (0%)      100Mi (0%)       400Mi (0%)     17d
  instana-agent                           instana-agent-wcrpd                    600m (7%)     2 (25%)     2112Mi (3%)      5Gi (9%)       17d
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-j8srl          30m (0%)      0 (0%)      150Mi (0%)       0 (0%)         17d
  openshift-cluster-node-tuning-operator  tuned-sqr8x                            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         17d
  openshift-dns                           dns-default-bz2tn                      60m (0%)      0 (0%)      110Mi (0%)       0 (0%)         17d
  openshift-dns                           node-resolver-ch4p8                    5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         17d
  openshift-image-registry                node-ca-qbg5z                          10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         17d
  openshift-ingress-canary                ingress-canary-nxzhm                   10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         17d
  openshift-kube-proxy                    openshift-kube-proxy-6hrcg             110m (1%)     0 (0%)      220Mi (0%)       0 (0%)         17d
  openshift-machine-config-operator       machine-config-daemon-72v92            40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         17d
  openshift-marketplace                   community-operators-qf8pl              10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         11s
  openshift-marketplace                   community-operators-rcvrr              10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         103m
  openshift-marketplace                   redhat-marketplace-2hmqz               10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         31h
  openshift-marketplace                   redhat-marketplace-ck9cx               10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         11s
  openshift-monitoring                    node-exporter-zcgb7                    9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         17d
  openshift-multus                        multus-additional-cni-plugins-7vftw    10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         17d
  openshift-multus                        multus-pjnkv                           10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         17d
  openshift-multus                        network-metrics-daemon-mg8nm           20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         17d
  openshift-network-diagnostics           network-check-target-qh5rp             10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         17d
  openshift-nfd                           nfd-worker-f68pj                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         17d
  splunk-agent                            splunk-agent-cjr78                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         17d
  sysdig-agent                            sysdig-agent-hh922                     1 (12%)       2 (25%)     512Mi (0%)       1536Mi (2%)    17d
  sysdig-agent                            sysdig-image-analyzer-tgfbp            250m (3%)     500m (6%)   512Mi (0%)       1536Mi (2%)    17d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         3424m (43%)   4500m (56%)
  memory                      5910Mi (10%)  8592Mi (15%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:                       <none>

smithbk avatar Jul 19 '22 16:07 smithbk

@smithbk Can you describe the project oc describe project nvidia-gpu-operator to see if annotation was added to pick the nodeSelector? https://docs.openshift.com/container-platform/4.10/nodes/scheduling/nodes-scheduler-taints-tolerations.html#nodes-scheduler-taints-tolerations-projects_nodes-scheduler-taints-tolerations

shivamerla avatar Jul 19 '22 17:07 shivamerla

@shivamerla I was able get placement to work by oc edit namespace nvidia-gpu-operator and changing

    openshift.io/node-selector: enterprise.discover.com/shared=true

to

    openshift.io/node-selector: enterprise.discover.com/dedicated=true

in the metadata.annotations section. However, now the readiness and liveness probes are failing as shown below

image

I'm speculating, but is the pod trying to connect to the wrong API endpoint perhaps because the endpoint is different for shared verses dedicated. If yes, is there a way to configure the appropriate API endpoint for dedicated?

Otherwise, any idea why the failure? Everything else seems to be working fine in this cluster.

smithbk avatar Jul 20 '22 13:07 smithbk

@smithbk That should be the pod IP which kubelet uses to probe for readiness/liveness. What does logs of gpu-operator show? Is ClusterPolicy status ready?

shivamerla avatar Jul 20 '22 16:07 shivamerla

@shivamerla The ClusterPolicy status is not ready.

image

Logs of the gpu-operator pod show the following:

1.6583371149379141e+09	ERROR	Failed to get API Group-Resources	{"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/cluster.New
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cluster/cluster.go:160
sigs.k8s.io/controller-runtime/pkg/manager.New
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:313
main.main
	/workspace/main.go:68
runtime.main
	/usr/local/go/src/runtime/proc.go:255
1.6583371149380147e+09	ERROR	setup	unable to start manager	{"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o timeout"}
main.main
	/workspace/main.go:77
runtime.main
	/usr/local/go/src/runtime/proc.go:255

smithbk avatar Jul 20 '22 17:07 smithbk

@shivamerla Is it possible that the gpu-operator is trying to access the shared pod IP instead of the dedicated pod IP? Just guessing.

smithbk avatar Jul 21 '22 19:07 smithbk

@smithbk Can you attach complete gpu-operator pod logs? It should be the kubernetes service-ip and reachable. oc get service kubernetes -n default.

shivamerla avatar Jul 21 '22 23:07 shivamerla

@shivamerla What I pasted above was the complete gpu-operator log, but here it is again:

1.658493246896439e+09	ERROR	Failed to get API Group-Resources	{"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/cluster.New
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cluster/cluster.go:160
sigs.k8s.io/controller-runtime/pkg/manager.New
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:313
main.main
	/workspace/main.go:68
runtime.main
	/usr/local/go/src/runtime/proc.go:255
1.6584932468965394e+09	ERROR	setup	unable to start manager	{"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o timeout"}
main.main
	/workspace/main.go:77
runtime.main
	/usr/local/go/src/runtime/proc.go:255

smithbk avatar Jul 22 '22 12:07 smithbk

@smithbk We would need to understand more on how the cluster is setup. This error should not be isolated to GPU Operator but any pod which is trying to access API server. Can you work with Red Hat to understand this better?

shivamerla avatar Jul 22 '22 17:07 shivamerla