EKSA bare metal cluster scale-in doesn't honor new hardware.csv file
What happened: In EKSA bare metal cluster, I try to scale-down the cluster by 1 worker node and remove a specific worker node from hardware.csv file. And run following command: eksctl anywhere upgrade cluster -f eksa-new.yaml --hardware-csv hardware-new.csv However, it turns out the node to remove may not be the desired one.
What you expected to happen: The desired work node is removed.
How to reproduce it (as minimally and precisely as possible): create an EKSA bare metal cluster with 2 worker nodes
Anything else we need to know?:
Environment:
- EKS Anywhere Release: v0.18.7
- EKS Distro Release:
eksctl anywhere upgrade -f hardware-new.csv
This is not the right upgrade command
https://anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/#upgrade-cluster-command
eksctl anywhere upgrade cluster -f cluster.yaml
# --hardware-csv <hardware.csv> \ # uncomment to add more hardware
--kubeconfig mgmt/mgmt-eks-a-cluster.kubeconfig
You should pass in the cluster spec to the -f instead of hardware.csv
@jiayiwang7 sorry my bad, I already update the description.
Here was my original command:
eksctl anywhere upgrade cluster -f eksa-mgmt05-cluster-cp1-worker1.yaml --hardware-csv hardware-mgmt05-2-new.csv --no-timeouts -v 9 --skip-validations=pod-disruption
Hi @ygao-armada Thanks for creating the issue. I believe we should have this resolved in our upcoming release.
This issue has been resolved in our latest patch release v0.19.7
I am still seeing this issue in our Bare-metal setup
EKS-A version:
eksctl anywhere version
Version: v0.20.1
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml
Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/69/manifest.yaml
Before starting scale in, I have 2 worker nodes
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
instance-528 Ready control-plane 4h24m v1.29.5-eks-1109419 10.103.15.163 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-529 Ready <none> 31m v1.29.5-eks-1109419 10.103.15.165 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-530 Ready <none> 3h34m v1.29.5-eks-1109419 10.103.15.182 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-531 Ready control-plane 4h7m v1.29.5-eks-1109419 10.103.15.184 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-532 Ready control-plane 3h47m v1.29.5-eks-1109419 10.103.15.186 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
Then I have edited my hardware csv file to remove the instance-530 worker node
cat hardware-targeted-scale-down.csv
hostname,bmc_ip,bmc_username,bmc_password,mac,ip_address,netmask,gateway,nameservers,labels,disk
instance-531,10.204.196.126,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.184,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=cp,/dev/sda
instance-532,10.204.196.127,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.186,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=cp,/dev/sda
instance-528,10.204.196.125,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.163,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=cp,/dev/sda
instance-529,10.204.196.129,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.165,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=worker,/dev/nvme0n1
I have also adjusted my cluster config file to scale the worker node count to 1 and then ran the command
eksctl anywhere upgrade cluster -f cluster-config-upgrade-20240730094824-scale-to-1.yaml --hardware-csv hardware-targeted-scale-down.csv --kubeconfig /home/ubuntu/eksanywhere/eksa-xxxx-cluster2n/eksa-xxxx-cluster2n-eks-a-cluster.kubeconfig --skip-validations=pod-disruption
Performing setup and validations
✅ Tinkerbell provider validation
✅ SSH Keys present
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
✅ Worker nodes ready
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-Anywhere Version
✅ Validate eksa controller is not paused
✅ Validate eksaVersion skew is one minor version
Ensuring etcd CAPI providers exist on management cluster before upgrade
Pausing GitOps cluster resources reconcile
Upgrading core components
Backing up management cluster's resources before upgrading
Upgrading management cluster
Updating Git Repo with new EKS-A cluster spec
Finalized commit and committed to local repository {"hash": "2d209dbf9ebd2a0f45ff88c8fe1a793f4d11348a"}
Forcing reconcile Git repo with latest commit
Resuming GitOps cluster resources kustomization
Writing cluster config file
🎉 Cluster upgraded!
Cleaning up backup resources
However, I still see EKS Anywhere does not delete the node that I had removed from hardware csv. Instead, it starts deleting the other node.
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
instance-528 Ready control-plane 4h26m v1.29.5-eks-1109419 10.103.15.163 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-529 Ready,SchedulingDisabled <none> 33m v1.29.5-eks-1109419 10.103.15.165 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-530 Ready <none> 3h36m v1.29.5-eks-1109419 10.103.15.182 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-531 Ready control-plane 4h9m v1.29.5-eks-1109419 10.103.15.184 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-532 Ready control-plane 3h49m v1.29.5-eks-1109419 10.103.15.186 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
Per the logic of what I understand has been fixed, instance-530 should have been deleted as it was removed from hardware csv. However after the scale in upgrade, the other node is deleted
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
instance-528 Ready control-plane 4h32m v1.29.5-eks-1109419 10.103.15.163 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-530 Ready <none> 3h43m v1.29.5-eks-1109419 10.103.15.182 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-531 Ready control-plane 4h15m v1.29.5-eks-1109419 10.103.15.184 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
instance-532 Ready control-plane 3h55m v1.29.5-eks-1109419 10.103.15.186 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.18-0-gae71819c4
Can someone help?????????