Keith Smith comments

Results 29 comments of


                                            Keith Smith

Nvidia GPU operator failing to install on OpenShift with dedicated rather than shared nodes

@shivamerla There are no taints on the GPU nodes. For example, the following is the describe output for one of them: ``` Name: ip-10-111-47-77.ec2.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=p3.2xlarge beta.kubernetes.io/os=linux...

Nvidia GPU operator failing to install on OpenShift with dedicated rather than shared nodes

@shivamerla I was able get placement to work by `oc edit namespace nvidia-gpu-operator` and changing ``` openshift.io/node-selector: enterprise.discover.com/shared=true ``` to ``` openshift.io/node-selector: enterprise.discover.com/dedicated=true ``` in the metadata.annotations section. However, now...

Nvidia GPU operator failing to install on OpenShift with dedicated rather than shared nodes

@shivamerla The ClusterPolicy status is not ready. ![image](https://user-images.githubusercontent.com/9592001/180043044-0bb24e89-b234-4657-8cdf-e82a3e865223.png) Logs of the gpu-operator pod show the following: ``` 1.6583371149379141e+09 ERROR Failed to get API Group-Resources {"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443:...

Nvidia GPU operator failing to install on OpenShift with dedicated rather than shared nodes

@shivamerla Is it possible that the gpu-operator is trying to access the shared pod IP instead of the dedicated pod IP? Just guessing.

Nvidia GPU operator failing to install on OpenShift with dedicated rather than shared nodes

@shivamerla What I pasted above was the complete gpu-operator log, but here it is again: ``` 1.658493246896439e+09 ERROR Failed to get API Group-Resources {"error": "Get \"https://172.23.0.1:443/api?timeout=32s\": dial tcp 172.23.0.1:443: i/o...

gpu-operator pod in CrashLoopBackOff

@kpouget Kevin, do you know who might be able to help with this? Thanks

gpu-operator pod in CrashLoopBackOff

@kpouget @shivamerla Here is the pod description ``` $ oc describe pod gpu-operator-566644fc46-2znxj Name: gpu-operator-566644fc46-2znxj Namespace: openshift-operators Priority: 2000001000 Priority Class Name: system-node-critical Node: ip-10-111-61-177.ec2.internal/10.111.61.177 Start Time: Tue, 05 Apr...

gpu-operator pod in CrashLoopBackOff

@kpouget Looks OK to me. If there is some other way of checking, let me know. ``` $ oc describe node ip-10-111-61-177.ec2.internal Name: ip-10-111-61-177.ec2.internal Roles: infra,worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m5a.2xlarge beta.kubernetes.io/os=linux...

gpu-operator pod in CrashLoopBackOff

@kpouget Any other ideas of what to check, or someone else who would know? Thanks

gpu-operator pod in CrashLoopBackOff

@kpouget The pod is running now but the cluster policy status is not progressing. Here is what I'm seeing now. ``` $ oc get pod -n openshift-operators | grep gpu-operator...