cockroach-operator icon indicating copy to clipboard operation
cockroach-operator copied to clipboard

Cluster stateful set stuck when failed scheduling occurs

Open mbrancato opened this issue 4 years ago • 5 comments

If there exists a cluster and the resource requests are scaled up resulting in failed scheduling, that cluster node will be stuck in a failed state, even in resource requests are reduced.

The flow looks like:

  1. Scale up cluster nodes
  2. Last node in stateful set will be updated
  3. Node fails to schedule
  4. Reduce the resource requests for the CrdbCluster
  5. Stateful set is not updated to reflect new resource requests

The workaround is that the cluster admin or whoever with access must manually edit the stateful set to assign the updated requests.

mbrancato avatar Oct 11 '21 16:10 mbrancato

Questions Last node in stateful set will be updated - how? Does the operator update it? By node do you mean pod? Node fails to schedule - why? Can you do a describe Reduce the resource requests for the CrdbCluster - are you editing the crd? Stateful set is not updated to reflect new resource requests - this I am guessing is a separate problem

To recap. You add a new k8s node and somehow the last pod of the sts is updated and then does not schedule.

If you edit the resource requests such as CPU or memory the sts is not updated.

chrislovecnm avatar Oct 11 '21 18:10 chrislovecnm

Questions Last node in stateful set will be updated - how? Does the operator update it? By node do you mean pod?

Yes - the Pod is also the CRDB node in the CRDB cluster.

Node fails to schedule - why? Can you do a describe

It failed to schedule because of not enough CPU and Memory. e.g.

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   63s (x2 over 63s)  default-scheduler   0/72 nodes are available: 44 Insufficient memory, 72 Insufficient cpu.
  Normal   NotTriggerScaleUp  61s                cluster-autoscaler  pod didn't trigger scale-up: 1 Insufficient cpu, 1 Insufficient memory, 4 max node group size reached

Reduce the resource requests for the CrdbCluster - are you editing the crd?

Yes

Stateful set is not updated to reflect new resource requests - this I am guessing is a separate problem

What else would update the stateful set if not the operator? It literally has the label: app.kubernetes.io/managed-by=cockroach-operator

Sorry, I glossed over some things. As I understand it all....

The cockroach-operator acts as a controller for the CrdbCluster CRD. There is a stateful set controller for StatefulSet resources that ultimately controls Pod spec changes.

When the stateful set controller makes changes, it makes a change to the last pod in a stateful set and does not move on to others until the last pod is alive.

When updating the CrdbCluster, the controller is updating the underlying StatefulSet. If that results in failed scheduling due to resources, the admin may adjust the resources on the CrdbCluster. The cockroach-operator is not picking up that second change to the CrdbCluster and applying it to the underlying StatefulSet. This may be a logic loop that is waiting for the prior change to complete successfully.

I gave it several cycles in the backoff retry to see if the operator would update the StatefulSet - it did not.

mbrancato avatar Oct 11 '21 20:10 mbrancato

Thanks for all the details!!!!

So the problem was that the pod did not schedule due to lack of resources on the node. When you updated the sts resource settings the sts did not update, or the pod was intern not updated.

We will triage and see if we can recreate and fix this.

chrislovecnm avatar Oct 11 '21 21:10 chrislovecnm

@davidwding you mind taking a look?

chrislovecnm avatar Oct 11 '21 21:10 chrislovecnm

Thanks for looking into it. To be clear, I did the following:

  • Updated the resources in the CrdbCluster CRD
    • This caused the sts to update and pods to fail scheduling
  • Updated the resources in the CrdbCluster CRD again
    • This did not update the sts, pods continued to fail scheduling
  • Updated the resources in the sts directly
    • pods successfully scheduled

mbrancato avatar Oct 11 '21 23:10 mbrancato