gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Cluster Policy fails to start on Openshift 4.9

Open manishdash12 opened this issue 3 years ago • 3 comments

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node? - RHEL 7.9
  • [x] Are you running Kubernetes v1.13+?
  • [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I am trying to install Nvidia GPU operator on a Openshift 4.9 cluster on IBM Cloud Its a single node cluster - the node has 2x P100 cards and ample CPU/RAM/Storage.

  • I was able to install the operator from operatorhub smoothly (have tried versions 1.9, 1.10, 1.11)
  • When I try to create ClusterPolicy, it gets created but its status never becomes Ready.

2. Steps to reproduce the issue

Not sure

3. Information to attach (optional if deemed irrelevant)

I tried some of the troubleshooting methods on the Nvidia docs on this.

  1. After creating the ClusterPolicy, in other clusters I would immediately see a lot of pods getting created in the Init state. But I only see the operator pod in the nvidia-gpu-operator namespace.

  2. I tried to see the operator logs using the command oc logs -f -n nvidia-gpu-operator -lapp=gpu-operator

    I see a consistent error here:

I0701 11:31:15.012773       1 request.go:665] Waited for 1.000296617s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/packages.operators.coreos.com/v1?timeout=32s
1.6566750761031365e+09  INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8080"}
1.656675076103603e+09   INFO    setup   starting manager
1.6566750761038709e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.656675076103909e+09   INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0701 11:31:16.103963       1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-operator/53822513.nvidia.com...
I0701 11:31:31.435254       1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator/53822513.nvidia.com
1.6566750914353292e+09  DEBUG   events  Normal  {"object": {"kind":"ConfigMap","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"31a99e69-e0f7-42be-99d8-f2b130d5355c","apiVersion":"v1","resourceVersion":"2067460"}, "reason": "LeaderElection", "message": "gpu-operator-776dbc5f44-4fttb_ea480d69-70b5-4528-a7bb-c8264500c94a became leader"}
1.6566750914354377e+09  DEBUG   events  Normal  {"object": {"kind":"Lease","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"f2f831e0-0afa-47ea-b981-aeff213040be","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2067461"}, "reason": "LeaderElection", "message": "gpu-operator-776dbc5f44-4fttb_ea480d69-70b5-4528-a7bb-c8264500c94a became leader"}
1.6566750914354844e+09  INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.ClusterPolicy"}
1.65667509143553e+09    INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.Node"}
1.6566750914355376e+09  INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.DaemonSet"}
1.6566750914355426e+09  INFO    controller.clusterpolicy-controller     Starting Controller
1.6566750915367239e+09  INFO    controllers.ClusterPolicy       Reconciliate ClusterPolicies after node label update    {"nb": 0}
1.6566750915367982e+09  INFO    controller.clusterpolicy-controller     Starting workers        {"worker count": 1}
1.6566751728138194e+09  ERROR   controllers.ClusterPolicy       Failed to initialize ClusterPolicy controller   {"error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751728139005e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "gpu-cluster-policy", "namespace": "", "error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751729343596e+09  ERROR   controllers.ClusterPolicy       Failed to initialize ClusterPolicy controller   {"error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751729344525e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "gpu-cluster-policy", "namespace": "", "error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

I am not able to pinpoint why this is stuck with this particular error.

  • I have tried to look at NFD ( I guess its not able to get the cluster version from the node labels - but they are the same as another cluster where everything works)
  • Googling the error - the ClusterVersion CRD came up. Checked it and its defined properly in the cluster.

manishdash12 avatar Jul 01 '22 12:07 manishdash12

@manishdash12 Can you get the output of oc get clusterversions -o yaml, we look for the last successful updated version.(i.e state: Completed).

    history:
    - completionTime: "2022-03-15T14:44:39Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:6a899c54dda6b844bb12a247e324a0f6cde367e880b73ba110c056df6d018032
      startedTime: "2022-03-15T14:19:46Z"
      state: Completed
      verified: false
      version: 4.9.24
    observedGeneration: 2
    versionHash: J4j8PKeiaRA=

shivamerla avatar Jul 01 '22 16:07 shivamerla

@manishdash12 any update on this?

shivamerla avatar Jul 11 '22 16:07 shivamerla

@manishdash12 Can you get the output of oc get clusterversions -o yaml, we look for the last successful updated version.(i.e state: Completed).

    history:
    - completionTime: "2022-03-15T14:44:39Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:6a899c54dda6b844bb12a247e324a0f6cde367e880b73ba110c056df6d018032
      startedTime: "2022-03-15T14:19:46Z"
      state: Completed
      verified: false
      version: 4.9.24
    observedGeneration: 2
    versionHash: J4j8PKeiaRA=

Hi @shivamerla I have the same issue, it looks like below image

And I think the installation of gpu operator from operatorhub should NOT depend on the "Completed" cluster version of OCP. Once the clusterversion exists, no such it is Completed or Partial, the cluster shall be treated as OCP.

MingZhang-YBPS avatar Jan 23 '24 12:01 MingZhang-YBPS