eks-anywhere
eks-anywhere copied to clipboard
EKS-A upgrade triggers move before workload cluster's RollingUpgrade is complete and fails
What happened:
While upgrading from an older EKS-A release on a management cluster with a single workload cluster, there is a step where the new EKS-A controller Deployment is applied to the cluster being upgraded, where a rolling upgrade starts on a workload cluster. A new eks-a controller comes up (unpaused) and applies the new CAPI/CAPC cluster spec, which has a modified value for EtcdadmCluster. The new resource removes the field etcdadmBuiltin: true and adds a new field for etcdadmInstallCommands.
As a result, the upgrade fails either on the move to the bootstrap cluster, or it times out waiting for ControlPlaneReady on the workload cluster after the move because the move happened in the middle of the workload cluster's upgrade.
What you expected to happen: I think a RollingUpgrade is expected in this case, so my proposal would be that we should wait for all the clusters associated with a given management cluster to finish upgrading before proceeding with the clusterctl move
How to reproduce it (as minimally and precisely as possible): Create a Cloudstack management cluster with 3 etcd, 2 CP, 3 worker nodes with an old eks-a release (v0.8.3-dev). Then create a workload cluster with the same configuration using the management cluster. Proceed to upgrade the management cluster using the v0.10.0 release of EKS-A.
Anything else we need to know?:
Environment:
- EKS Anywhere Release: v0.10.0
- EKS Distro Release:
Discussing with @vivek-koppuru, he suggested performing a check prior to executing the move where we call kubectl get clusters.cluster.x-k8s.io and then looking at the Status field to make sure they are in the Ready state before executing clusterctl move. For management clusters, loop over all the results. The issue, however, is it's not clear how to assess the Ready state of a cluster. clusterctl describe cluster generates it intelligently somehow and it would be great to be able to use the same value
I think a RollingUpgrade is expected in this case, so my proposal would be that we should wait for all the clusters associated with a given management cluster to finish upgrading before proceeding with the clusterctl move
One major security concern for this is that we are conducting management cluster's etcd, control plane machine upgrade on the cluster itself, instead of in a bootstrap cluster (which is the principle we follow to upgrade management cluster).