eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

EKSA bare metal cluster scale-in doesn't honor new hardware.csv file

Open ygao-armada opened this issue 1 year ago • 2 comments

What happened: In EKSA bare metal cluster, I try to scale-down the cluster by 1 worker node and remove a specific worker node from hardware.csv file. And run following command: eksctl anywhere upgrade cluster -f eksa-new.yaml --hardware-csv hardware-new.csv However, it turns out the node to remove may not be the desired one.

What you expected to happen: The desired work node is removed.

How to reproduce it (as minimally and precisely as possible): create an EKSA bare metal cluster with 2 worker nodes

Anything else we need to know?:

Environment:

  • EKS Anywhere Release: v0.18.7
  • EKS Distro Release:

ygao-armada avatar May 21 '24 04:05 ygao-armada

eksctl anywhere upgrade -f hardware-new.csv

This is not the right upgrade command

https://anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/#upgrade-cluster-command

eksctl anywhere upgrade cluster -f cluster.yaml 
# --hardware-csv <hardware.csv> \ # uncomment to add more hardware
--kubeconfig mgmt/mgmt-eks-a-cluster.kubeconfig

You should pass in the cluster spec to the -f instead of hardware.csv

jiayiwang7 avatar May 22 '24 15:05 jiayiwang7

@jiayiwang7 sorry my bad, I already update the description.

Here was my original command: eksctl anywhere upgrade cluster -f eksa-mgmt05-cluster-cp1-worker1.yaml --hardware-csv hardware-mgmt05-2-new.csv --no-timeouts -v 9 --skip-validations=pod-disruption

ygao-armada avatar May 22 '24 15:05 ygao-armada

Hi @ygao-armada Thanks for creating the issue. I believe we should have this resolved in our upcoming release.

pokearu avatar Jun 12 '24 04:06 pokearu

This issue has been resolved in our latest patch release v0.19.7

sp1999 avatar Jun 14 '24 21:06 sp1999

I am still seeing this issue in our Bare-metal setup

EKS-A version:

eksctl anywhere version
Version: v0.20.1
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml
Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/69/manifest.yaml

Before starting scale in, I have 2 worker nodes

kubectl get nodes -o wide
NAME           STATUS   ROLES           AGE     VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
instance-528   Ready    control-plane   4h24m   v1.29.5-eks-1109419   10.103.15.163   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-529   Ready    <none>          31m     v1.29.5-eks-1109419   10.103.15.165   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-530   Ready    <none>          3h34m   v1.29.5-eks-1109419   10.103.15.182   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-531   Ready    control-plane   4h7m    v1.29.5-eks-1109419   10.103.15.184   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-532   Ready    control-plane   3h47m   v1.29.5-eks-1109419   10.103.15.186   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4

Then I have edited my hardware csv file to remove the instance-530 worker node

cat hardware-targeted-scale-down.csv
hostname,bmc_ip,bmc_username,bmc_password,mac,ip_address,netmask,gateway,nameservers,labels,disk
instance-531,10.204.196.126,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.184,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=cp,/dev/sda
instance-532,10.204.196.127,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.186,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=cp,/dev/sda
instance-528,10.204.196.125,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.163,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=cp,/dev/sda
instance-529,10.204.196.129,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.165,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=worker,/dev/nvme0n1

I have also adjusted my cluster config file to scale the worker node count to 1 and then ran the command

eksctl anywhere upgrade cluster -f cluster-config-upgrade-20240730094824-scale-to-1.yaml --hardware-csv hardware-targeted-scale-down.csv --kubeconfig /home/ubuntu/eksanywhere/eksa-xxxx-cluster2n/eksa-xxxx-cluster2n-eks-a-cluster.kubeconfig --skip-validations=pod-disruption
Performing setup and validations
✅ Tinkerbell provider validation
✅ SSH Keys present
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
✅ Worker nodes ready
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-Anywhere Version
✅ Validate eksa controller is not paused
✅ Validate eksaVersion skew is one minor version
Ensuring etcd CAPI providers exist on management cluster before upgrade
Pausing GitOps cluster resources reconcile
Upgrading core components
Backing up management cluster's resources before upgrading
Upgrading management cluster
Updating Git Repo with new EKS-A cluster spec
Finalized commit and committed to local repository      {"hash": "2d209dbf9ebd2a0f45ff88c8fe1a793f4d11348a"}
Forcing reconcile Git repo with latest commit
Resuming GitOps cluster resources kustomization
Writing cluster config file
🎉 Cluster upgraded!
Cleaning up backup resources

However, I still see EKS Anywhere does not delete the node that I had removed from hardware csv. Instead, it starts deleting the other node.

kubectl get nodes -o wide
NAME           STATUS                     ROLES           AGE     VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
instance-528   Ready                      control-plane   4h26m   v1.29.5-eks-1109419   10.103.15.163   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-529   Ready,SchedulingDisabled   <none>          33m     v1.29.5-eks-1109419   10.103.15.165   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-530   Ready                      <none>          3h36m   v1.29.5-eks-1109419   10.103.15.182   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-531   Ready                      control-plane   4h9m    v1.29.5-eks-1109419   10.103.15.184   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-532   Ready                      control-plane   3h49m   v1.29.5-eks-1109419   10.103.15.186   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4

Per the logic of what I understand has been fixed, instance-530 should have been deleted as it was removed from hardware csv. However after the scale in upgrade, the other node is deleted

kubectl get nodes -o wide
NAME           STATUS   ROLES           AGE     VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
instance-528   Ready    control-plane   4h32m   v1.29.5-eks-1109419   10.103.15.163   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-530   Ready    <none>          3h43m   v1.29.5-eks-1109419   10.103.15.182   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-531   Ready    control-plane   4h15m   v1.29.5-eks-1109419   10.103.15.184   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-532   Ready    control-plane   3h55m   v1.29.5-eks-1109419   10.103.15.186   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4

Can someone help?????????

thecloudgarage avatar Jul 30 '24 14:07 thecloudgarage