eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

Proper procedures for recovering EKS-A cluster from broken state.

Open SunghoHong-gif opened this issue 4 months ago • 1 comments

I have a question about recovering EKS-A provisioned clusters from a broken state.

Suppose a cluster has failed machines in both the control plane and worker nodes, and these machines are assumed to be unrecoverable (=physically broken and have to add new baremetal machines). How should this be handled when we want to execute cluster upgrade?

example@example-admin:~$ kubectl get nodes
NAME                STATUS                        ROLES           AGE    VERSION
example-cp3-26       Ready                         control-plane   191d   v1.28.15
example-cp3-27       NotReady,SchedulingDisabled   control-plane   191d   v1.29.13
example-cp5-26       Ready                         control-plane   191d   v1.28.15
example-gpu-wk3-9    NotReady,SchedulingDisabled   <none>          191d   v1.29.13
example-gpu-wk5-11   Ready                         <none>          191d   v1.28.15

Are we expected to manually add healthy control plane and worker nodes to proceed with the cluster upgrade? Or are we expected to re-provision the cluster from scratch and execute backups?

I’m trying to understand the intended recovery path when the cluster is in an unstable state and cannot be restored using the originally provisioned machines.

SunghoHong-gif avatar Jul 30 '25 01:07 SunghoHong-gif