gpu-operator
gpu-operator copied to clipboard
Nvidia GPU operator failing to install on OpenShift with dedicated rather than shared nodes
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [ ] Are you running Kubernetes v1.13+?
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - [ ] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
I am trying to install the NVIDIA GPU Operator on version 4.8.39 of OpenShift Container Platform and the gpu-operator pod is failing to start because "node(s) didn't match Pod's node affinity/selector". There are multiple GPU nodes but they all have the "enterprise.discover.com/dedicated: true" label rather than the "enterprise.discover.com/shared: true" label which is what the gpu-operator pod requires.
I opened https://access.redhat.com/support/cases/*/case/03265998 for this issue against Redhat but they said to reach out to Nvidia to determine how to edit the CRD or daemonset to add the correct nodeselector.
Can you tell me how to configure it to use the nodeselector for "enterprise.discover.com/dedicated: true" rather than "enterprise.discover.com/shared: true", or how else to make this work?
Thank you, Keith
2. Steps to reproduce the issue
Installed the Node Feature Discovery operator and then tried to install the NVIDIA GPU operator.
3. Information to attach (optional if deemed irrelevant)
-
[ ] kubernetes pods status:
kubectl get pods --all-namespaces
-
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces
-
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
-
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
-
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
-
[ ] Docker configuration file:
cat /etc/docker/daemon.json
-
[ ] Docker runtime configuration:
docker info | grep runtime
-
[ ] NVIDIA shared directory:
ls -la /run/nvidia
-
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
-
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
-
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs
@smithbk by default gpu-operator pod deployed through OLM doesn't have any specific nodeSelector/tolerations. Did you add the nodeSelector by editing the CSV? Can you get the podSpec for operator Deployment as well as node taints you have setup?
@shivamerla No, I didn't edit anything. I'm not exactly sure where I would edit the CSV. The entire pod yaml is below, but the gpu-operator deployment yaml does not contain a nodeSelector. So I am not sure what is inserting the nodeSelector into the pod spec. Any ideas? What else can I provide?
Thanks
kind: Pod
apiVersion: v1
metadata:
generateName: gpu-operator-78b97d8f98-
annotations:
openshift.io/scc: hostmount-anyuid
operators.operatorframework.io/builder: operator-sdk-v1.4.0
operators.operatorframework.io/project_layout: go.kubebuilder.io/v3
certified: 'true'
olm.targetNamespaces: nvidia-gpu-operator
operatorframework.io/properties: >-
{"properties":[{"type":"olm.gvk","value":{"group":"nvidia.com","kind":"ClusterPolicy","version":"v1"}},{"type":"olm.package","value":{"packageName":"gpu-operator-certified","version":"1.9.1"}}]}
repository: 'http://github.com/NVIDIA/gpu-operator'
support: NVIDIA
provider: NVIDIA
operators.openshift.io/infrastructure-features: '["Disconnected"]'
alm-examples: |-
[
{
"apiVersion": "nvidia.com/v1",
"kind": "ClusterPolicy",
"metadata": {
"name": "gpu-cluster-policy"
},
"spec": {
"dcgmExporter": {
"config": {
"name": ""
}
},
"dcgm": {
"enabled": true
},
"daemonsets": {
},
"devicePlugin": {
},
"driver": {
"enabled": true,
"use_ocp_driver_toolkit": true,
"repoConfig": {
"configMapName": ""
},
"certConfig": {
"name": ""
},
"licensingConfig": {
"nlsEnabled": false,
"configMapName": ""
},
"virtualTopology": {
"config": ""
}
},
"gfd": {
},
"migManager": {
"enabled": true
},
"nodeStatusExporter": {
"enabled": true
},
"operator": {
"defaultRuntime": "crio",
"deployGFD": true,
"initContainer": {
}
},
"mig": {
"strategy": "single"
},
"toolkit": {
"enabled": true
},
"validator": {
"plugin": {
"env": [
{
"name": "WITH_WORKLOAD",
"value": "true"
}
]
}
}
}
}
]
capabilities: Basic Install
olm.operatorNamespace: nvidia-gpu-operator
containerImage: >-
nvcr.io/nvidia/gpu-operator@sha256:173639a16409d7aeba2f5ca7ccd6260ea621d280a24b8933089d6df0623bb657
createdAt: 'Thu Jan 20 07:14:13 PST 2022'
categories: 'AI/Machine Learning, OpenShift Optional'
operatorframework.io/suggested-namespace: nvidia-gpu-operator
description: Automate the management and monitoring of NVIDIA GPUs.
olm.operatorGroup: nvidia-gpu-operator-fqxw8
resourceVersion: '229373247'
name: gpu-operator-78b97d8f98-ps87s
uid: 8f2ac191-2659-47ee-96cd-3d77c922bfbe
creationTimestamp: '2022-07-13T15:37:10Z'
managedFields:
- manager: kube-controller-manager
operation: Update
apiVersion: v1
time: '2022-07-13T15:37:10Z'
fieldsType: FieldsV1
fieldsV1:
'f:metadata':
'f:annotations':
'f:olm.operatorNamespace': {}
'f:provider': {}
'f:operators.openshift.io/infrastructure-features': {}
'f:createdAt': {}
'f:alm-examples': {}
'f:description': {}
'f:olm.operatorGroup': {}
'f:capabilities': {}
.: {}
'f:containerImage': {}
'f:categories': {}
'f:operatorframework.io/suggested-namespace': {}
'f:operators.operatorframework.io/project_layout': {}
'f:certified': {}
'f:operatorframework.io/properties': {}
'f:operators.operatorframework.io/builder': {}
'f:support': {}
'f:olm.targetNamespaces': {}
'f:repository': {}
'f:generateName': {}
'f:labels':
.: {}
'f:app': {}
'f:app.kubernetes.io/component': {}
'f:pod-template-hash': {}
'f:ownerReferences':
.: {}
'k:{"uid":"60fd6652-827f-4cf4-b5e8-53610f4fd0a7"}':
.: {}
'f:apiVersion': {}
'f:blockOwnerDeletion': {}
'f:controller': {}
'f:kind': {}
'f:name': {}
'f:uid': {}
'f:spec':
'f:volumes':
.: {}
'k:{"name":"host-os-release"}':
.: {}
'f:hostPath':
.: {}
'f:path': {}
'f:type': {}
'f:name': {}
'f:containers':
'k:{"name":"gpu-operator"}':
'f:image': {}
'f:volumeMounts':
.: {}
'k:{"mountPath":"/host-etc/os-release"}':
.: {}
'f:mountPath': {}
'f:name': {}
'f:readOnly': {}
'f:terminationMessagePolicy': {}
.: {}
'f:resources':
.: {}
'f:limits':
.: {}
'f:cpu': {}
'f:memory': {}
'f:requests':
.: {}
'f:cpu': {}
'f:memory': {}
'f:args': {}
'f:command': {}
'f:livenessProbe':
.: {}
'f:failureThreshold': {}
'f:httpGet':
.: {}
'f:path': {}
'f:port': {}
'f:scheme': {}
'f:initialDelaySeconds': {}
'f:periodSeconds': {}
'f:successThreshold': {}
'f:timeoutSeconds': {}
'f:env':
'k:{"name":"VALIDATOR_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"OPERATOR_NAMESPACE"}':
.: {}
'f:name': {}
'f:valueFrom':
.: {}
'f:fieldRef':
.: {}
'f:apiVersion': {}
'f:fieldPath': {}
'k:{"name":"DCGM_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"MIG_MANAGER_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"DRIVER_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"DCGM_EXPORTER_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"DEVICE_PLUGIN_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"HTTPS_PROXY"}':
.: {}
'f:name': {}
'f:value': {}
.: {}
'k:{"name":"CONTAINER_TOOLKIT_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"CUDA_BASE_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"GFD_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"NO_PROXY"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"DRIVER_MANAGER_IMAGE"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"HTTP_PROXY"}':
.: {}
'f:name': {}
'f:value': {}
'k:{"name":"OPERATOR_CONDITION_NAME"}':
.: {}
'f:name': {}
'f:value': {}
'f:readinessProbe':
.: {}
'f:failureThreshold': {}
'f:httpGet':
.: {}
'f:path': {}
'f:port': {}
'f:scheme': {}
'f:initialDelaySeconds': {}
'f:periodSeconds': {}
'f:successThreshold': {}
'f:timeoutSeconds': {}
'f:securityContext':
.: {}
'f:allowPrivilegeEscalation': {}
'f:terminationMessagePath': {}
'f:imagePullPolicy': {}
'f:ports':
.: {}
'k:{"containerPort":8080,"protocol":"TCP"}':
.: {}
'f:containerPort': {}
'f:name': {}
'f:protocol': {}
'f:name': {}
'f:dnsPolicy': {}
'f:priorityClassName': {}
'f:serviceAccount': {}
'f:restartPolicy': {}
'f:schedulerName': {}
'f:terminationGracePeriodSeconds': {}
'f:serviceAccountName': {}
'f:enableServiceLinks': {}
'f:securityContext': {}
- manager: kube-scheduler
operation: Update
apiVersion: v1
time: '2022-07-13T15:37:10Z'
fieldsType: FieldsV1
fieldsV1:
'f:status':
'f:conditions':
.: {}
'k:{"type":"PodScheduled"}':
.: {}
'f:lastProbeTime': {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
namespace: nvidia-gpu-operator
ownerReferences:
- apiVersion: apps/v1
kind: ReplicaSet
name: gpu-operator-78b97d8f98
uid: 60fd6652-827f-4cf4-b5e8-53610f4fd0a7
controller: true
blockOwnerDeletion: true
labels:
app: gpu-operator
app.kubernetes.io/component: gpu-operator
pod-template-hash: 78b97d8f98
spec:
nodeSelector:
enterprise.discover.com/shared: 'true'
restartPolicy: Always
serviceAccountName: gpu-operator
imagePullSecrets:
- name: gpu-operator-dockercfg-7zzpq
priority: 2000001000
schedulerName: default-scheduler
enableServiceLinks: true
terminationGracePeriodSeconds: 10
preemptionPolicy: PreemptLowerPriority
securityContext:
seLinuxOptions:
level: 's0:c31,c5'
containers:
- resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 200m
memory: 200Mi
readinessProbe:
httpGet:
path: /readyz
port: 8081
scheme: HTTP
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
name: gpu-operator
command:
- gpu-operator
livenessProbe:
httpGet:
path: /healthz
port: 8081
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 1
periodSeconds: 20
successThreshold: 1
failureThreshold: 3
env:
- name: OPERATOR_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: VALIDATOR_IMAGE
value: >-
nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:d7e0397249cd5099046506f32841535ea4f329f7b7583a6ddd9f75ff0f53385e
- name: GFD_IMAGE
value: >-
nvcr.io/nvidia/gpu-feature-discovery@sha256:bfc39d23568458dfd50c0c5323b6d42bdcd038c420fb2a2becd513a3ed3be27f
- name: CONTAINER_TOOLKIT_IMAGE
value: >-
nvcr.io/nvidia/k8s/container-toolkit@sha256:5f826e306d093332a86afa7d1e96218b5bdda8d33067931cff7914f6bb2994ee
- name: DCGM_IMAGE
value: >-
nvcr.io/nvidia/cloud-native/dcgm@sha256:f4c4de8d66b2fef8cebaee6fec2fb2d15d01e835de2654df6dfd4a0ce0baec6b
- name: DCGM_EXPORTER_IMAGE
value: >-
nvcr.io/nvidia/k8s/dcgm-exporter@sha256:8546a3e1ca8e642d2dbbfde13ee439e8137513be9886f9fe51f4fa9c4db80198
- name: DEVICE_PLUGIN_IMAGE
value: >-
nvcr.io/nvidia/k8s-device-plugin@sha256:69171f906efe4bbabe31688343e59feea08a7e0ef8b0d9efb466abfa153aec16
- name: DRIVER_IMAGE
value: >-
nvcr.io/nvidia/driver@sha256:c9af394fe78c02acbc2fae2c43fd73030ef54ab76156b1ae566eaed53f9b4835
- name: DRIVER_MANAGER_IMAGE
value: >-
nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:54233ebccbc3d2b388b237031907d58c3719d0e6f3ecb874349c91e8145225d2
- name: MIG_MANAGER_IMAGE
value: >-
nvcr.io/nvidia/cloud-native/k8s-mig-manager@sha256:d718dda2f9c9f0a465240772ed1ca6db44789d37255172e00637a092bdd1ba31
- name: CUDA_BASE_IMAGE
value: >-
nvcr.io/nvidia/cuda@sha256:e137c897256501537e0986963889a91ec90cac029b5263fc4b229b278f5b1a02
- name: HTTP_PROXY
value: 'http://proxy-app.discoverfinancial.com:8080'
- name: HTTPS_PROXY
value: 'http://proxy-app.discoverfinancial.com:8080'
- name: NO_PROXY
value: >-
.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.aws.discoverfinancial.com,.cluster.local,.discoverfinancial.com,.ec2.internal,.na.discoverfinancial.com,.ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,.ocp.aws.discoverfinancial.com,.ocpdev.us-east-1.ac.discoverfinancial.com,.prdops3-app.ocp.aws.discoverfinancial.com,.rw.discoverfinancial.com,.svc,10.0.0.0/8,10.111.0.0/16,127.0.0.1,169.254.169.254,172.23.0.0/16,172.24.0.0/14,api-int.aws-useast1-apps-lab-21.ocpdev.us-east-1.ac.discoverfinancial.com,artifactory.prdops3-app.ocp.aws.discoverfinancial.com,aws.discoverfinancial.com,discoverfinancial.com,ec2.internal,localhost,na.discoverfinancial.com,ocp-dev.artifactory.prdops3-app.ocp.aws.discoverfinancial.com,ocp.aws.discoverfinancial.com,ocpdev.us-east-1.ac.discoverfinancial.com,prdops3-app.ocp.aws.discoverfinancial.com,rw.discoverfinancial.com
- name: OPERATOR_CONDITION_NAME
value: gpu-operator-certified.v1.9.1
securityContext:
capabilities:
drop:
- MKNOD
allowPrivilegeEscalation: false
ports:
- name: metrics
containerPort: 8080
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: host-os-release
readOnly: true
mountPath: /host-etc/os-release
- name: kube-api-access-t6rwj
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
terminationMessagePolicy: File
image: >-
nvcr.io/nvidia/gpu-operator@sha256:173639a16409d7aeba2f5ca7ccd6260ea621d280a24b8933089d6df0623bb657
args:
- '--leader-elect'
serviceAccount: gpu-operator
volumes:
- name: host-os-release
hostPath:
path: /etc/os-release
type: ''
- name: kube-api-access-t6rwj
projected:
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
name: kube-root-ca.crt
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- configMap:
name: openshift-service-ca.crt
items:
- key: service-ca.crt
path: service-ca.crt
defaultMode: 420
dnsPolicy: ClusterFirst
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/memory-pressure
operator: Exists
effect: NoSchedule
priorityClassName: system-node-critical
status:
phase: Pending
conditions:
- type: PodScheduled
status: 'False'
lastProbeTime: null
lastTransitionTime: '2022-07-13T15:37:10Z'
reason: Unschedulable
message: >-
0/16 nodes are available: 13 node(s) didn't match Pod's node
affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/master:
}, that the pod didn't tolerate.
qosClass: Burstable
@smithbk Can you get the taints on the GPU nodes? We probably need to add those tolerations to the GPU Operator pod. This is done by editing CSV. oc get csv -n <nvidia-gpu-operator>
and edit gpu-operator-certified
CSV. We can also add following toleration to let in run on master node.
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
@shivamerla There are no taints on the GPU nodes. For example, the following is the describe output for one of them:
Name: ip-10-111-47-77.ec2.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=p3.2xlarge
beta.kubernetes.io/os=linux
contact=DNA-OCP
enterprise.discover.com/cluster-id=aws-useast1-apps-lab-7jw2f
enterprise.discover.com/cluster-name=aws-useast1-apps-lab-21
enterprise.discover.com/cost_center=458002
enterprise.discover.com/data-classification=na
enterprise.discover.com/dedicated=true
enterprise.discover.com/environment=lab
enterprise.discover.com/freedom=false
enterprise.discover.com/gdpr=false
enterprise.discover.com/openshift=true
enterprise.discover.com/openshift-role=worker
enterprise.discover.com/pci=false
enterprise.discover.com/product=datalake
enterprise.discover.com/public=false
enterprise.discover.com/support-assignment-group=DNA-OCP
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1c
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.HLE=true
feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
feature.node.kubernetes.io/cpu-cpuid.RTM=true
feature.node.kubernetes.io/cpu-cpuid.SSE4=true
feature.node.kubernetes.io/cpu-cpuid.SSE42=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/cpu-pstate.scaling_governor=performance
feature.node.kubernetes.io/cpu-pstate.status=active
feature.node.kubernetes.io/cpu-pstate.turbo=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true
feature.node.kubernetes.io/kernel-selinux.enabled=true
feature.node.kubernetes.io/kernel-version.full=4.18.0-305.45.1.el8_4.x86_64
feature.node.kubernetes.io/kernel-version.major=4
feature.node.kubernetes.io/kernel-version.minor=18
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-1013.present=true
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-1d0f.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=rhcos
feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION=4.8
feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=48.84.202204202010-0
feature.node.kubernetes.io/system-os_release.RHEL_VERSION=8.4
feature.node.kubernetes.io/system-os_release.VERSION_ID=4.8
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=8
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-111-47-77
kubernetes.io/os=linux
machine.openshift.io/cluster-api-cluster=aws-useast1-apps-lab-21
machine.openshift.io/cluster-api-cluster-name=aws-useast1-apps-lab-21
machine.openshift.io/cluster-api-machine-role=worker
machine.openshift.io/cluster-api-machineset=mrc-codes-1c
machine.openshift.io/cluster-api-machineset-group=mrc-codes
machine.openshift.io/cluster-api-machineset-ha=1c
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=p3.2xlarge
node.openshift.io/os_id=rhcos
topology.ebs.csi.aws.com/zone=us-east-1c
topology.kubernetes.io/region=us-east-1
topology.kubernetes.io/zone=us-east-1c
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0202e413d71c10150"}
machine.openshift.io/machine: openshift-machine-api/mrc-codes-1c-hpnm7
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-b84cfcfa061050187c89440b81894cef
machineconfiguration.openshift.io/desiredConfig: rendered-worker-b84cfcfa061050187c89440b81894cef
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
nfd.node.kubernetes.io/extended-resources:
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.HLE,cpu-cpuid.HYPERVISOR,cpu-cpuid.RTM,cpu-cpuid.SSE4,...
nfd.node.kubernetes.io/worker.version: 1.16
projectcalico.org/IPv4Address: 10.111.47.77/20
projectcalico.org/IPv4IPIPTunnelAddr: 172.25.146.128
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 01 Jul 2022 14:52:10 -0400
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-10-111-47-77.ec2.internal
AcquireTime: <unset>
RenewTime: Tue, 19 Jul 2022 12:21:53 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 01 Jul 2022 14:53:13 -0400 Fri, 01 Jul 2022 14:53:13 -0400 CalicoIsUp Calico is running on this node
MemoryPressure False Tue, 19 Jul 2022 12:19:49 -0400 Fri, 01 Jul 2022 14:52:10 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 19 Jul 2022 12:19:49 -0400 Fri, 01 Jul 2022 14:52:10 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 19 Jul 2022 12:19:49 -0400 Fri, 01 Jul 2022 14:52:10 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 19 Jul 2022 12:19:49 -0400 Fri, 01 Jul 2022 14:53:10 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.111.47.77
Hostname: ip-10-111-47-77.ec2.internal
InternalDNS: ip-10-111-47-77.ec2.internal
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 8
ephemeral-storage: 125277164Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 62855724Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 7910m
ephemeral-storage: 115455434152
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 58697421127680m
pods: 250
System Info:
Machine ID: a839b8350a194b8b807ce2d9634313b2
System UUID: ec28b13d-d2c8-64b5-0a39-b020a348b232
Boot ID: e86e5f34-ff9c-4eb5-ac6c-367349fe2613
Kernel Version: 4.18.0-305.45.1.el8_4.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 48.84.202204202010-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.21.6-3.rhaos4.8.git19780ee.2.el8
Kubelet Version: v1.21.8+ed4d8fd
Kube-Proxy Version: v1.21.8+ed4d8fd
ProviderID: aws:///us-east-1c/i-0202e413d71c10150
Non-terminated Pods: (24 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
calico-system calico-node-n4rg6 1 (12%) 0 (0%) 1536Mi (2%) 0 (0%) 17d
centralized-logging-solution cls-filebeat-vg4bk 200m (2%) 0 (0%) 100Mi (0%) 400Mi (0%) 17d
instana-agent instana-agent-wcrpd 600m (7%) 2 (25%) 2112Mi (3%) 5Gi (9%) 17d
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-j8srl 30m (0%) 0 (0%) 150Mi (0%) 0 (0%) 17d
openshift-cluster-node-tuning-operator tuned-sqr8x 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 17d
openshift-dns dns-default-bz2tn 60m (0%) 0 (0%) 110Mi (0%) 0 (0%) 17d
openshift-dns node-resolver-ch4p8 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 17d
openshift-image-registry node-ca-qbg5z 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 17d
openshift-ingress-canary ingress-canary-nxzhm 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 17d
openshift-kube-proxy openshift-kube-proxy-6hrcg 110m (1%) 0 (0%) 220Mi (0%) 0 (0%) 17d
openshift-machine-config-operator machine-config-daemon-72v92 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 17d
openshift-marketplace community-operators-qf8pl 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 11s
openshift-marketplace community-operators-rcvrr 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 103m
openshift-marketplace redhat-marketplace-2hmqz 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 31h
openshift-marketplace redhat-marketplace-ck9cx 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 11s
openshift-monitoring node-exporter-zcgb7 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 17d
openshift-multus multus-additional-cni-plugins-7vftw 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 17d
openshift-multus multus-pjnkv 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 17d
openshift-multus network-metrics-daemon-mg8nm 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 17d
openshift-network-diagnostics network-check-target-qh5rp 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 17d
openshift-nfd nfd-worker-f68pj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17d
splunk-agent splunk-agent-cjr78 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17d
sysdig-agent sysdig-agent-hh922 1 (12%) 2 (25%) 512Mi (0%) 1536Mi (2%) 17d
sysdig-agent sysdig-image-analyzer-tgfbp 250m (3%) 500m (6%) 512Mi (0%) 1536Mi (2%) 17d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3424m (43%) 4500m (56%)
memory 5910Mi (10%) 8592Mi (15%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events: <none>
@smithbk Can you describe the project oc describe project nvidia-gpu-operator
to see if annotation was added to pick the nodeSelector? https://docs.openshift.com/container-platform/4.10/nodes/scheduling/nodes-scheduler-taints-tolerations.html#nodes-scheduler-taints-tolerations-projects_nodes-scheduler-taints-tolerations
@shivamerla I was able get placement to work by oc edit namespace nvidia-gpu-operator
and changing
openshift.io/node-selector: enterprise.discover.com/shared=true
to
openshift.io/node-selector: enterprise.discover.com/dedicated=true
in the metadata.annotations section. However, now the readiness and liveness probes are failing as shown below
I'm speculating, but is the pod trying to connect to the wrong API endpoint perhaps because the endpoint is different for shared verses dedicated. If yes, is there a way to configure the appropriate API endpoint for dedicated?
Otherwise, any idea why the failure? Everything else seems to be working fine in this cluster.
@smithbk That should be the pod IP which kubelet uses to probe for readiness/liveness. What does logs of gpu-operator show? Is ClusterPolicy status ready
?
@shivamerla The ClusterPolicy status is not ready.
Logs of the gpu-operator pod show the following:
1.6583371149379141e+09 ERROR Failed to get API Group-Resources {"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/cluster.New
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cluster/cluster.go:160
sigs.k8s.io/controller-runtime/pkg/manager.New
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:313
main.main
/workspace/main.go:68
runtime.main
/usr/local/go/src/runtime/proc.go:255
1.6583371149380147e+09 ERROR setup unable to start manager {"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o timeout"}
main.main
/workspace/main.go:77
runtime.main
/usr/local/go/src/runtime/proc.go:255
@shivamerla Is it possible that the gpu-operator is trying to access the shared pod IP instead of the dedicated pod IP? Just guessing.
@smithbk Can you attach complete gpu-operator pod logs? It should be the kubernetes service-ip and reachable. oc get service kubernetes -n default
.
@shivamerla What I pasted above was the complete gpu-operator log, but here it is again:
1.658493246896439e+09 ERROR Failed to get API Group-Resources {"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/cluster.New
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cluster/cluster.go:160
sigs.k8s.io/controller-runtime/pkg/manager.New
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:313
main.main
/workspace/main.go:68
runtime.main
/usr/local/go/src/runtime/proc.go:255
1.6584932468965394e+09 ERROR setup unable to start manager {"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o timeout"}
main.main
/workspace/main.go:77
runtime.main
/usr/local/go/src/runtime/proc.go:255
@smithbk We would need to understand more on how the cluster is setup. This error should not be isolated to GPU Operator but any pod which is trying to access API server. Can you work with Red Hat to understand this better?