cass-operator K8SSAND-1348 ⁃ Cluster can get into a deadlocked state, unable to recover without manual intervention

What happened? Please see #302 for the full story. This is the "other issue I'll report separately" from step 4. We’re using cass-operator 1.5.1; I do not yet know if this is still a problem with the latest version.

Did you expect to see something different? I expected cass-operator to figure things out and bring our cluster up.

How to reproduce it (as minimally and precisely as possible): This one is tricky. First you need to get a cluster with a Cassandra pod in a bad state. Not allocating enough memory should do the trick. The bad node needs to be stuck in Starting.

Now, try to move one of the other pods to another node. It will get stuck in Ready to Start, unable to proceed because cass-operator is waiting on the other node.

I think, if there are other pods not in Started, cass-operator should probably give up on the "Starting" pod after some period of time and see if it can start one of the other ones first. Maybe there's a more sophisticated option here; I don't know what it would be.

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1348 ┆priority: Medium

Mar 23 '22 02:03 jakerobb

@jakerobb can you please provide the version of cass-operator?

Mar 23 '22 02:03 jsanda

➤ Jake Robb commented:

It’s 1.5.1. I added that to the description.

Mar 23 '22 02:03 sync-by-unito[bot]

The problem is that we don't know for sure that "Starting" has failed and as such shouldn't start another node. Giving up "Starting" means we should kill the pod after some time and that time should be something what's maximum for a pod to start correctly. In some instances this could take a while and if we keep killing it the cluster would never start its nodes.

So unless we can reliably say that the pod can't be started (or failed to start) in its current configuration we might not want to do anything. Otherwise we risk the state that the cluster will never boot up without a chance to even a manual recovery. The ticket in #302 mentions the part 4 as "Ready-to-Start" and this one talks about "Starting", but I'm assuming this one is the correct state.

The #302 story is a bit odd since it involved manually messing up with resources (overriding cass-operator behavior) and there are probably lots of such scenarios where we can't automatically recover from, since we don't have information what happened.

Mar 25 '22 11:03 burmanm

➤ Jake Robb commented:

Regarding the Ready-to-Start and Starting states, I just went back and re-read. I’m confident I described it correctly. Note that there are two instances in play (plus a third not mentioned).

I understand about not knowing if Starting is making progress or if it’s stuck. This would involve changes in the underlying database, but maybe there could be some kind of pre-gossip where a starting database would make its progress known and we could reliably interpret silence on that channel as a signal of stuck-ness.

Or maybe the operator could look for telltale signs in the log files? We knew that the Starting pod was never going to succeed based on an infinite loop that was apparent from the logs. That wouldn’t help in every scenario where a pod in Starting{{ is stuck, but it would help in a subset of cases, and the capability could be expanded upon as more signals of stuck}}-{{ness are discovered. }}

I fully recognize the complexity of what I’m asking for, but the fact that Cassandra is complex to operate is the whole reason cass-operator exists, so I feel like this is at least worth a conversation. 🙂

Mar 25 '22 12:03 sync-by-unito[bot]

We could in theory follow the logs, do you happen to have them that show easily what happened? Of course it would be even better if management-api would know that things have gone wrong, but logs / evidence / information helps to make at least some advancements.

Mar 25 '22 13:03 burmanm

Closing due to lack of information.

Sep 20 '22 06:09 burmanm