cass-operator
cass-operator copied to clipboard
K8SSAND-1737 ⁃ StatefulSet restart can cause a bug in startNode process..
What happened? I manually edited the StatefulSet annotations to cause rolling restart (same way as kubectl rollout restart would do). First pod was started correctly again and got to 2/2 status. Next one did not, but cass-operator started putting stuff to logs:
1.6606510016415665e+09 INFO controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller reconcile_racks::startAllNodes {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "cass-operator", "namespace": "cass-operator", "datacenterName": "dc1", "clusterName": "cluster2"}
1.6606510016416464e+09 ERROR controllers.CassandraDatacenter calculateReconciliationActions returned an error {"cassandradatacenter": "cass-operator/dc1", "requestNamespace": "cass-operator", "requestName": "dc1", "loopID": "a1357b72-90c1-4156-b7da-a7d09e415ca2", "error": "checks failed desired:3, ready:2, started:3"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
1.6606510016416826e+09 INFO controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller checks failed desired:3, ready:2, started:3 {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "cass-operator", "reason": "ReconcileFailed", "eventType": "Warning"}
1.6606510016417089e+09 INFO controllers.CassandraDatacenter Reconcile loop completed {"cassandradatacenter": "cass-operator/dc1", "requestNamespace": "cass-operator", "requestName": "dc1", "loopID": "a1357b72-90c1-4156-b7da-a7d09e415ca2", "duration": 0.021397826}
1.6606510016417375e+09 ERROR controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller Reconciler error {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc1", "namespace": "cass-operator", "error": "checks failed desired:3, ready:2, started:3"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
1.6606510016417637e+09 DEBUG events Warning {"object": {"kind":"CassandraDatacenter","namespace":"cass-operator","name":"dc1","uid":"549e5ab9-072f-48f8-934a-799efe705c8a","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"12355"}, "reason": "ReconcileFailed", "message": "checks failed desired:3, ready:2, started:3"}
It seems unable to recover from this issue.
Did you expect to see something different?
How to reproduce it (as minimally and precisely as possible):
Environment
-
Cass Operator version:
* Kubernetes version information:Insert image tag or Git SHA here
* Kubernetes cluster kind:kubectl version
insert how you created your cluster: kops, bootkube, etc.
- Manifests:
insert manifests relevant to the issue
- Cass Operator Logs:
insert Cass Operator logs relevant to the issue here
Anything else we need to know?:
┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1737 ┆priority: Medium
While it did recover from that eventually (stuck nodes killing), I didn't like that. Sadly, I'm having issues reproducing the issue so that I could debug it.
Note that #403 is going to change startAllNodes
significantly, so we might want to investigate if this is still happening after that change is merged.
Closing this since start process was modified. Open a new one if new information appears.