eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

eksctl anywhere upgrade cluster doesn't use the extra hardware from hardware.csv

Open ygao-armada opened this issue 11 months ago • 2 comments

What happened: I created an EKS anywhere cluster with 1 CP node, kubernetes 1.26

Then I try to upgrade it to 1.27, with 2 new CP nodes plus the existing CP node in the new hardware file.

cluster config file change is:

diff eksa-mgmt02-md-cluster.yaml eksa-mgmt02-md-cluster-sc.yaml
25c25
<   kubernetesVersion: "1.26"
---
>   kubernetesVersion: "1.27"
36c36
<   osImageURL: "https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.26.7.gz"
---
>   osImageURL: "https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.27.11.gz"
70c70
<           IMG_URL: https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.26.7.gz
---
>           IMG_URL: https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.27.11.gz

hardware change is:

diff hardware-mgmt02.csv hardware-mgmt02-cp3.csv 
2a3,4
> eksa-control-02,...,type=cp,/dev/sda
> eksa-control-03,...,type=cp,/dev/sda

The command I use: eksctl anywhere upgrade cluster -f eksa-mgmt02-md-cluster-sc.yaml --hardware-csv hardware-mgmt02-cp3.csv --no-timeouts -v 9

I see upgrade stuck with following errors:

...
2024-04-15T07:16:38.920Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1713165386699246825 kubectl get --ignore-not-found -o json --kubeconfig /home/armada/eksa/mgmt02/mgmt02/mgmt02-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-04-15T07:16:39.030Z	V9	Cluster generation and observedGeneration	{"Generation": 2, "ObservedGeneration": 2}
2024-04-15T07:16:39.031Z	V5	Error happened during retry	{"error": "cluster has an error: hardware validation failure: for rolling upgrade, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1", "retries": 1}
2024-04-15T07:16:39.031Z	V5	Sleeping before next retry	{"time": "1s"}
...
2024-04-15T07:32:06.179Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1713165386699246825 kubectl get --ignore-not-found -o json --kubeconfig /home/armada/eksa/mgmt02/mgmt02/mgmt02-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-04-15T07:32:06.285Z	V9	Cluster generation and observedGeneration	{"Generation": 2, "ObservedGeneration": 2}
2024-04-15T07:32:06.285Z	V5	Error happened during retry	{"error": "cluster has an error: hardware validation failure: for rolling upgrade, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1", "retries": 832}
2024-04-15T07:32:06.286Z	V5	Sleeping before next retry	{"time": "1s"}
...

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • EKS Anywhere Release: v0.18.7
  • EKS Distro Release: 1.26/1.27

ygao-armada avatar Mar 10 '24 22:03 ygao-armada

From the initial look what I understand is this was a rolling and scaling upgrade request. This means that 2 extra hardwares were expected for the upgrade - 1 for the scale up, 1 spare for the rolling upgrade of 1.26 -> 1.27. My guess is that the extra hardware you added to the csv was used for the new CP node and eks anywhere didn't have enough for the rolling upgrade.

mitalipaygude avatar Apr 11 '24 23:04 mitalipaygude

@mitalipaygude I just try to use 1 more machine (1 existing host + 2 idle hosts), same issue shows:

2024-04-14T08:59:52.814Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1713084293614893292 kubectl get --ignore-not-found -o json --kubeconfig /home/armada/eksa/mgmt02/mgmt02/mgmt02-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-04-14T08:59:52.934Z	V9	Cluster generation and observedGeneration	{"Generation": 3, "ObservedGeneration": 3}
2024-04-14T08:59:52.934Z	V5	Error happened during retry	{"error": "cluster has an error: hardware validation failure: for rolling upgrade, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1", "retries": 794}
2024-04-14T08:59:52.934Z	V5	Sleeping before next retry	{"time": "1s"}

At the same time, according to https://anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/:

EKS Anywhere upgrades on Bare Metal require at least one spare hardware server for control plane upgrade and one for each worker node group upgrade.

So 1 spare machine is good enough.

ygao-armada avatar Apr 14 '24 09:04 ygao-armada