eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

control plane RollingUpdate config silently breaks cluster creation

Open swapzero opened this issue 5 months ago • 4 comments

What happened:

While following the documented process for Bare Metal RollingUpgrades, I prepared a new cluster.yaml that includes the upgradeRolloutStrategy set to RollingUpdate for both the control plane and worker node groups.

Initially, I encountered a validation error:

Error: the cluster config file provided is invalid: validating upgrade rollout strategy configuration: WorkerNodeGroupConfiguration: upgradeRolloutStrategy.rollingUpdate field is required for upgradeRolloutStrategy.type RollingUpdate

After addressing this by adding the required rollingUpdate fields for the worker group, the configuration passed validation. For the control plane, it appears maxSurge defaults to 1, and since the preflight checks succeed without explicitly setting it, I assumed that was acceptable.

Here is the diff between the original and updated cluster.yaml:

--- cluster.yaml.orig   2025-07-12 21:15:44.266427279 +0000
+++ cluster.yaml        2025-07-12 21:16:23.706199561 +0000
@@ -13,6 +13,8 @@
       cidrBlocks:
       - 10.96.0.0/12
   controlPlaneConfiguration:
+    upgradeRolloutStrategy:
+      type: RollingUpdate
     count: 3
     endpoint:
       host: "10.162.10.140"
@@ -31,6 +33,11 @@
       kind: TinkerbellMachineConfig
       name: eks-a
     name: worker-group
+    upgradeRolloutStrategy:
+      rollingUpdate:
+        maxSurge: 1
+        maxUnavailable: 0
+      type: RollingUpdate
 ---
 apiVersion: anywhere.eks.amazonaws.com/v1alpha1
 kind: TinkerbellDatacenterConfig

When attempting to create the cluster using:

eksctl anywhere create cluster \
  --hardware-csv hardware.csv \
  -f cluster.yaml \
  --no-timeouts

The process failed with an error that suggested a missing TinkerbellMachineConfig:

cluster has an error: Dependent cluster objects don't exist: TinkerbellMachineConfig.anywhere.eks.amazonaws.com "eks-a-cp" not found

However, this resource clearly exists in the bootstrap cluster:

$ kubectl get TinkerbellMachineConfig --all-namespaces
NAMESPACE   NAME       AGE
default     eks-a      54s
default     eks-a-cp   54s

Despite the resource being present, the cluster creation process keeps retrying endlessly. Even with --no-timeouts, which ironically guarantees being stuck in a loop with no helpful feedback.


What you expected to happen:

I expected the cluster creation to either work or fail with a clear and accurate error. If something is misconfigured, the message should explain what's actually wrong so it's possible to fix it without guesswork.


How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster.yaml that includes upgradeRolloutStrategy.type: RollingUpdate for both control plane and worker groups. Ensure rollingUpdate values are included for the worker group but left out for the control plane (since it passes validation).
  2. Run eksctl anywhere create cluster using the configuration.
  3. Observe the error related to a missing TinkerbellMachineConfig, even though the resource is present, and note the endless retry loop behavior.

Anything else we need to know?:


Environment:

  • EKS Anywhere Release
$ eksctl anywhere version
Version: v0.22.6
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml
Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/100/manifest.yaml

swapzero avatar Jul 12 '25 22:07 swapzero

@swapzero For rolling upgrade in baremetal, enough hardware should be available; based on the details you have provided, I don't see if you have added required servers to support rolling upgrade. Regd. the -no-timeout, it meant for the specific DC environments where certain config can take much more than anticipated time, its upto the user if they want no timeout in their provisioning cycle, you don't have to always use it.

ndeksa avatar Jul 16 '25 23:07 ndeksa

@ndeksa this is cluster creation, there is nothing there except for the hardware needed for creation. The problem clearly seems to be related to something else.

swapzero avatar Jul 17 '25 06:07 swapzero

@swapzero The default upgradeRolloutStrategy type is RollingUpdate as specified in the docs here and the default values for the corresponding maxSurge and maxUnavailable fields are 1 and 0 respectively. I don't see a cluster creation scenario where there is a need to explicitly configure these values in the cluster spec. Can you explain what's the usecase here?

sp1999 avatar Jul 19 '25 00:07 sp1999

@sp1999 sure, it's straightforward. That is the configuration I am going to use from now on for this cluster :).

And I cannot create the cluster with that configuration. That's the problem.

I mean it shouldn't fail to create an EKS-A cluster just because an option is on default value.

LE: Even if RollingUpdate is the default strategy, it wouldn't hurt to explicitly set it in the configuration for more clarity. Except that you can't create the cluster if you don't specify maxSurge. Since maxSurge has also a default, why does it fail? :)

swapzero avatar Jul 19 '25 10:07 swapzero