eks-anywhere
eks-anywhere copied to clipboard
eksctl anywhere upgrade cluster doesn't use the extra hardware from hardware.csv
What happened: I created an EKS anywhere cluster with 1 CP node, kubernetes 1.26
Then I try to upgrade it to 1.27, with 2 new CP nodes plus the existing CP node in the new hardware file.
cluster config file change is:
diff eksa-mgmt02-md-cluster.yaml eksa-mgmt02-md-cluster-sc.yaml
25c25
< kubernetesVersion: "1.26"
---
> kubernetesVersion: "1.27"
36c36
< osImageURL: "https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.26.7.gz"
---
> osImageURL: "https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.27.11.gz"
70c70
< IMG_URL: https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.26.7.gz
---
> IMG_URL: https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.27.11.gz
hardware change is:
diff hardware-mgmt02.csv hardware-mgmt02-cp3.csv
2a3,4
> eksa-control-02,...,type=cp,/dev/sda
> eksa-control-03,...,type=cp,/dev/sda
The command I use:
eksctl anywhere upgrade cluster -f eksa-mgmt02-md-cluster-sc.yaml --hardware-csv hardware-mgmt02-cp3.csv --no-timeouts -v 9
I see upgrade stuck with following errors:
...
2024-04-15T07:16:38.920Z V6 Executing command {"cmd": "/usr/bin/docker exec -i eksa_1713165386699246825 kubectl get --ignore-not-found -o json --kubeconfig /home/armada/eksa/mgmt02/mgmt02/mgmt02-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-04-15T07:16:39.030Z V9 Cluster generation and observedGeneration {"Generation": 2, "ObservedGeneration": 2}
2024-04-15T07:16:39.031Z V5 Error happened during retry {"error": "cluster has an error: hardware validation failure: for rolling upgrade, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1", "retries": 1}
2024-04-15T07:16:39.031Z V5 Sleeping before next retry {"time": "1s"}
...
2024-04-15T07:32:06.179Z V6 Executing command {"cmd": "/usr/bin/docker exec -i eksa_1713165386699246825 kubectl get --ignore-not-found -o json --kubeconfig /home/armada/eksa/mgmt02/mgmt02/mgmt02-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-04-15T07:32:06.285Z V9 Cluster generation and observedGeneration {"Generation": 2, "ObservedGeneration": 2}
2024-04-15T07:32:06.285Z V5 Error happened during retry {"error": "cluster has an error: hardware validation failure: for rolling upgrade, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1", "retries": 832}
2024-04-15T07:32:06.286Z V5 Sleeping before next retry {"time": "1s"}
...
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- EKS Anywhere Release: v0.18.7
- EKS Distro Release: 1.26/1.27
From the initial look what I understand is this was a rolling and scaling upgrade request. This means that 2 extra hardwares were expected for the upgrade - 1 for the scale up, 1 spare for the rolling upgrade of 1.26 -> 1.27. My guess is that the extra hardware you added to the csv was used for the new CP node and eks anywhere didn't have enough for the rolling upgrade.
@mitalipaygude I just try to use 1 more machine (1 existing host + 2 idle hosts), same issue shows:
2024-04-14T08:59:52.814Z V6 Executing command {"cmd": "/usr/bin/docker exec -i eksa_1713084293614893292 kubectl get --ignore-not-found -o json --kubeconfig /home/armada/eksa/mgmt02/mgmt02/mgmt02-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-04-14T08:59:52.934Z V9 Cluster generation and observedGeneration {"Generation": 3, "ObservedGeneration": 3}
2024-04-14T08:59:52.934Z V5 Error happened during retry {"error": "cluster has an error: hardware validation failure: for rolling upgrade, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1", "retries": 794}
2024-04-14T08:59:52.934Z V5 Sleeping before next retry {"time": "1s"}
At the same time, according to https://anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/:
EKS Anywhere upgrades on Bare Metal require at least one spare hardware server for control plane upgrade and one for each worker node group upgrade.
So 1 spare machine is good enough.