Cluster Policy fails to start on Openshift 4.9
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node? - RHEL 7.9
- [x] Are you running Kubernetes v1.13+?
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [ ] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
I am trying to install Nvidia GPU operator on a Openshift 4.9 cluster on IBM Cloud Its a single node cluster - the node has 2x P100 cards and ample CPU/RAM/Storage.
- I was able to install the operator from operatorhub smoothly (have tried versions 1.9, 1.10, 1.11)
- When I try to create ClusterPolicy, it gets created but its status never becomes
Ready.
2. Steps to reproduce the issue
Not sure
3. Information to attach (optional if deemed irrelevant)
I tried some of the troubleshooting methods on the Nvidia docs on this.
-
After creating the ClusterPolicy, in other clusters I would immediately see a lot of pods getting created in the
Initstate. But I only see the operator pod in thenvidia-gpu-operatornamespace. -
I tried to see the operator logs using the command
oc logs -f -n nvidia-gpu-operator -lapp=gpu-operatorI see a consistent error here:
I0701 11:31:15.012773 1 request.go:665] Waited for 1.000296617s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/packages.operators.coreos.com/v1?timeout=32s
1.6566750761031365e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
1.656675076103603e+09 INFO setup starting manager
1.6566750761038709e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.656675076103909e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0701 11:31:16.103963 1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-operator/53822513.nvidia.com...
I0701 11:31:31.435254 1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator/53822513.nvidia.com
1.6566750914353292e+09 DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"31a99e69-e0f7-42be-99d8-f2b130d5355c","apiVersion":"v1","resourceVersion":"2067460"}, "reason": "LeaderElection", "message": "gpu-operator-776dbc5f44-4fttb_ea480d69-70b5-4528-a7bb-c8264500c94a became leader"}
1.6566750914354377e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"f2f831e0-0afa-47ea-b981-aeff213040be","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2067461"}, "reason": "LeaderElection", "message": "gpu-operator-776dbc5f44-4fttb_ea480d69-70b5-4528-a7bb-c8264500c94a became leader"}
1.6566750914354844e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.ClusterPolicy"}
1.65667509143553e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.Node"}
1.6566750914355376e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.DaemonSet"}
1.6566750914355426e+09 INFO controller.clusterpolicy-controller Starting Controller
1.6566750915367239e+09 INFO controllers.ClusterPolicy Reconciliate ClusterPolicies after node label update {"nb": 0}
1.6566750915367982e+09 INFO controller.clusterpolicy-controller Starting workers {"worker count": 1}
1.6566751728138194e+09 ERROR controllers.ClusterPolicy Failed to initialize ClusterPolicy controller {"error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751728139005e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "gpu-cluster-policy", "namespace": "", "error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751729343596e+09 ERROR controllers.ClusterPolicy Failed to initialize ClusterPolicy controller {"error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751729344525e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "gpu-cluster-policy", "namespace": "", "error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
I am not able to pinpoint why this is stuck with this particular error.
- I have tried to look at NFD ( I guess its not able to get the cluster version from the node labels - but they are the same as another cluster where everything works)
- Googling the error - the ClusterVersion CRD came up. Checked it and its defined properly in the cluster.
@manishdash12 Can you get the output of oc get clusterversions -o yaml, we look for the last successful updated version.(i.e state: Completed).
history:
- completionTime: "2022-03-15T14:44:39Z"
image: quay.io/openshift-release-dev/ocp-release@sha256:6a899c54dda6b844bb12a247e324a0f6cde367e880b73ba110c056df6d018032
startedTime: "2022-03-15T14:19:46Z"
state: Completed
verified: false
version: 4.9.24
observedGeneration: 2
versionHash: J4j8PKeiaRA=
@manishdash12 any update on this?
@manishdash12 Can you get the output of
oc get clusterversions -o yaml, we look for the last successful updated version.(i.e state: Completed).history: - completionTime: "2022-03-15T14:44:39Z" image: quay.io/openshift-release-dev/ocp-release@sha256:6a899c54dda6b844bb12a247e324a0f6cde367e880b73ba110c056df6d018032 startedTime: "2022-03-15T14:19:46Z" state: Completed verified: false version: 4.9.24 observedGeneration: 2 versionHash: J4j8PKeiaRA=
Hi @shivamerla
I have the same issue, it looks like below
And I think the installation of gpu operator from operatorhub should NOT depend on the "Completed" cluster version of OCP. Once the clusterversion exists, no such it is Completed or Partial, the cluster shall be treated as OCP.