cass-operator icon indicating copy to clipboard operation
cass-operator copied to clipboard

K8SSAND-1737 ⁃ StatefulSet restart can cause a bug in startNode process..

Open burmanm opened this issue 2 years ago • 1 comments

What happened? I manually edited the StatefulSet annotations to cause rolling restart (same way as kubectl rollout restart would do). First pod was started correctly again and got to 2/2 status. Next one did not, but cass-operator started putting stuff to logs:

1.6606510016415665e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller        reconcile_racks::startAllNodes  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "cass-operator", "namespace": "cass-operator", "datacenterName": "dc1", "clusterName": "cluster2"}
1.6606510016416464e+09  ERROR   controllers.CassandraDatacenter calculateReconciliationActions returned an error        {"cassandradatacenter": "cass-operator/dc1", "requestNamespace": "cass-operator", "requestName": "dc1", "loopID": "a1357b72-90c1-4156-b7da-a7d09e415ca2", "error": "checks failed desired:3, ready:2, started:3"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
1.6606510016416826e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller        checks failed desired:3, ready:2, started:3   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "cass-operator", "reason": "ReconcileFailed", "eventType": "Warning"}
1.6606510016417089e+09  INFO    controllers.CassandraDatacenter Reconcile loop completed        {"cassandradatacenter": "cass-operator/dc1", "requestNamespace": "cass-operator", "requestName": "dc1", "loopID": "a1357b72-90c1-4156-b7da-a7d09e415ca2", "duration": 0.021397826}
1.6606510016417375e+09  ERROR   controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller        Reconciler error        {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "cass-operator", "error": "checks failed desired:3, ready:2, started:3"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
1.6606510016417637e+09  DEBUG   events  Warning {"object": {"kind":"CassandraDatacenter","namespace":"cass-operator","name":"dc1","uid":"549e5ab9-072f-48f8-934a-799efe705c8a","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"12355"}, "reason": "ReconcileFailed", "message": "checks failed desired:3, ready:2, started:3"}

It seems unable to recover from this issue.

Did you expect to see something different?

How to reproduce it (as minimally and precisely as possible):

Environment

  • Cass Operator version:

    Insert image tag or Git SHA here

    * Kubernetes version information:

    kubectl version

    * Kubernetes cluster kind:
insert how you created your cluster: kops, bootkube, etc.
  • Manifests:
insert manifests relevant to the issue
  • Cass Operator Logs:
insert Cass Operator logs relevant to the issue here

Anything else we need to know?:

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1737 ┆priority: Medium

burmanm avatar Aug 16 '22 12:08 burmanm

While it did recover from that eventually (stuck nodes killing), I didn't like that. Sadly, I'm having issues reproducing the issue so that I could debug it.

burmanm avatar Aug 17 '22 11:08 burmanm

Note that #403 is going to change startAllNodes significantly, so we might want to investigate if this is still happening after that change is merged.

adutra avatar Sep 13 '22 08:09 adutra

Closing this since start process was modified. Open a new one if new information appears.

burmanm avatar Sep 20 '22 06:09 burmanm