control plane RollingUpdate config silently breaks cluster creation
What happened:
While following the documented process for Bare Metal RollingUpgrades, I prepared a new cluster.yaml that includes the upgradeRolloutStrategy set to RollingUpdate for both the control plane and worker node groups.
Initially, I encountered a validation error:
Error: the cluster config file provided is invalid: validating upgrade rollout strategy configuration: WorkerNodeGroupConfiguration: upgradeRolloutStrategy.rollingUpdate field is required for upgradeRolloutStrategy.type RollingUpdate
After addressing this by adding the required rollingUpdate fields for the worker group, the configuration passed validation. For the control plane, it appears maxSurge defaults to 1, and since the preflight checks succeed without explicitly setting it, I assumed that was acceptable.
Here is the diff between the original and updated cluster.yaml:
--- cluster.yaml.orig 2025-07-12 21:15:44.266427279 +0000
+++ cluster.yaml 2025-07-12 21:16:23.706199561 +0000
@@ -13,6 +13,8 @@
cidrBlocks:
- 10.96.0.0/12
controlPlaneConfiguration:
+ upgradeRolloutStrategy:
+ type: RollingUpdate
count: 3
endpoint:
host: "10.162.10.140"
@@ -31,6 +33,11 @@
kind: TinkerbellMachineConfig
name: eks-a
name: worker-group
+ upgradeRolloutStrategy:
+ rollingUpdate:
+ maxSurge: 1
+ maxUnavailable: 0
+ type: RollingUpdate
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellDatacenterConfig
When attempting to create the cluster using:
eksctl anywhere create cluster \
--hardware-csv hardware.csv \
-f cluster.yaml \
--no-timeouts
The process failed with an error that suggested a missing TinkerbellMachineConfig:
cluster has an error: Dependent cluster objects don't exist: TinkerbellMachineConfig.anywhere.eks.amazonaws.com "eks-a-cp" not found
However, this resource clearly exists in the bootstrap cluster:
$ kubectl get TinkerbellMachineConfig --all-namespaces
NAMESPACE NAME AGE
default eks-a 54s
default eks-a-cp 54s
Despite the resource being present, the cluster creation process keeps retrying endlessly. Even with --no-timeouts, which ironically guarantees being stuck in a loop with no helpful feedback.
What you expected to happen:
I expected the cluster creation to either work or fail with a clear and accurate error. If something is misconfigured, the message should explain what's actually wrong so it's possible to fix it without guesswork.
How to reproduce it (as minimally and precisely as possible):
- Create a
cluster.yamlthat includesupgradeRolloutStrategy.type: RollingUpdatefor both control plane and worker groups. EnsurerollingUpdatevalues are included for the worker group but left out for the control plane (since it passes validation). - Run
eksctl anywhere create clusterusing the configuration. - Observe the error related to a missing
TinkerbellMachineConfig, even though the resource is present, and note the endless retry loop behavior.
Anything else we need to know?:
Environment:
- EKS Anywhere Release
$ eksctl anywhere version
Version: v0.22.6
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml
Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/100/manifest.yaml
@swapzero For rolling upgrade in baremetal, enough hardware should be available; based on the details you have provided, I don't see if you have added required servers to support rolling upgrade. Regd. the -no-timeout, it meant for the specific DC environments where certain config can take much more than anticipated time, its upto the user if they want no timeout in their provisioning cycle, you don't have to always use it.
@ndeksa this is cluster creation, there is nothing there except for the hardware needed for creation. The problem clearly seems to be related to something else.
@swapzero The default upgradeRolloutStrategy type is RollingUpdate as specified in the docs here and the default values for the corresponding maxSurge and maxUnavailable fields are 1 and 0 respectively. I don't see a cluster creation scenario where there is a need to explicitly configure these values in the cluster spec. Can you explain what's the usecase here?
@sp1999 sure, it's straightforward. That is the configuration I am going to use from now on for this cluster :).
And I cannot create the cluster with that configuration. That's the problem.
I mean it shouldn't fail to create an EKS-A cluster just because an option is on default value.
LE:
Even if RollingUpdate is the default strategy, it wouldn't hurt to explicitly set it in the configuration for more clarity.
Except that you can't create the cluster if you don't specify maxSurge. Since maxSurge has also a default, why does it fail? :)