Keith Smith
Keith Smith
@shivamerla There are no taints on the GPU nodes. For example, the following is the describe output for one of them: ``` Name: ip-10-111-47-77.ec2.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=p3.2xlarge beta.kubernetes.io/os=linux...
@shivamerla I was able get placement to work by `oc edit namespace nvidia-gpu-operator` and changing ``` openshift.io/node-selector: enterprise.discover.com/shared=true ``` to ``` openshift.io/node-selector: enterprise.discover.com/dedicated=true ``` in the metadata.annotations section. However, now...
@shivamerla The ClusterPolicy status is not ready.  Logs of the gpu-operator pod show the following: ``` 1.6583371149379141e+09 ERROR Failed to get API Group-Resources {"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443:...
@shivamerla Is it possible that the gpu-operator is trying to access the shared pod IP instead of the dedicated pod IP? Just guessing.
@shivamerla What I pasted above was the complete gpu-operator log, but here it is again: ``` 1.658493246896439e+09 ERROR Failed to get API Group-Resources {"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o...
@kpouget Kevin, do you know who might be able to help with this? Thanks
@kpouget @shivamerla Here is the pod description ``` $ oc describe pod gpu-operator-566644fc46-2znxj Name: gpu-operator-566644fc46-2znxj Namespace: openshift-operators Priority: 2000001000 Priority Class Name: system-node-critical Node: ip-10-111-61-177.ec2.internal/10.111.61.177 Start Time: Tue, 05 Apr...
@kpouget Looks OK to me. If there is some other way of checking, let me know. ``` $ oc describe node ip-10-111-61-177.ec2.internal Name: ip-10-111-61-177.ec2.internal Roles: infra,worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m5a.2xlarge beta.kubernetes.io/os=linux...
@kpouget Any other ideas of what to check, or someone else who would know? Thanks
@kpouget The pod is running now but the cluster policy status is not progressing. Here is what I'm seeing now. ``` $ oc get pod -n openshift-operators | grep gpu-operator...