eks-anywhere Unable to deploy EKS-A to vSphere cluster

Unable to deploy EKS-A to vSphere cluster

Open galvesribeiro opened this issue 2 months ago • 5 comments

What happened: Unable to deploy EKS-A on ESXI 8 U1

What you expected to happen: The initial cluster to be deployed

How to reproduce it (as minimally and precisely as possible): Just follow the process from the documentation to deploy the initial cluster.

When it tries to deploy the first etcd VM from the templates, the VM is created, but then briefly after creation it is removed and I see the following error:

A specified parameter was not correct: spec.config.deviceChange[0].operation

I've tried with multiple BR versions starting from 1.26 to 1.29 and all of them fail. Also tried on two completely separated ESXI/vSphere clusters with the same results.

Environment: Latest EKS-A CLI (from brew) on macOS Sonoma (fully updated) deploying to ESXi/vCenter/vSAN 8U1.

Apr 10 '24 05:04 galvesribeiro

I would check the spec: datastore: setting first. Can you post your cluster manifest ?

Apr 12 '24 17:04 Darth-Weider

@Darth-Weider thanks for the reply.

For those which are having similar issues here is a TL;DR;:

Set the VSphereMachineConfig.spec.cloneMode to linkedClone AND remove the diskGiB which is added by default when you call generate command on the CLI.
pods.cidrBlocks and services.cidrBlocks should NOT collide with the DHCP range as well. The DHCP range is ONLY used for the VM IPs.

I've just finally figured out what is going on here. There were a few things that weren't really clear when reading the docs:

The VSphereMachineConfig.spec.cloneMode when not set, and the diskGiB set to anything, as it is when the generate command run, would throw that error. Then we set cloneMode to linkedClone, it then fail the validation saying that we shouldn't set the diskGiB. We then removed the diskGiB and it worked. The images were deployed just fine.
The next failure was that it complains about the IP of the control plane not being unique when I'm pretty sure it was since I've created a single VLAN/subnet like 10.254.0.0/16 specifically for EKS-A. Then, besides .1 which is the gateway, I've set from .1 to .100 to be excluded from the DHCP range, and made the .10 the IP of the control plane VIP and it keep saying the .10 was in use. I then ran with the --skip-ip-verification as suggested by another issue here, and it passed thru but, the Etcd nodes never got ready for whatever reason and the process good looping waiting for it to be ready. It turns out that the documentation don't make clear that this pods/services.cidrBlocks must be any network that (1) doesn't collide with the host and other subnets on your physical network AND (2) is not part of that DHCP range (because of (1)). So as soon as I created a VLAN with 10.170.0.0/24, made the control plane VIP .10, the DHCP range to be .100-.154 and kept the pods.cidrBlocks to 10.254.0.0/16 (which is a non-routable address space on the physical network), it just worked. The reason of this confusion is that other k8s distros I've used use different CNIs where pods and/or services get IPs on the underlying network, but that is not the case with EKS-A.

Apr 12 '24 17:04 galvesribeiro

The fun fact is that this is not consistent. I've created the same config multiple times on the same environment and sometimes the process fail in the end with "Creating EKS-A namespace" "The connection to the server localhost:8080 was refused" which makes no sense as I don't have anything open on localhost at 8080.

Here is the config:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: awsemu
spec:
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
      - 172.18.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 3
    endpoint:
      host: "172.16.1.1"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: awsemu-cp
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: datacenter
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: awsemu-etcd
  kubernetesVersion: "1.29"
  managementCluster:
    name: awsemu
  workerNodeGroupConfigurations:
  - count: 1
    machineGroupRef:
      kind: VSphereMachineConfig
      name: awsemu
    name: md-0

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: datacenter
spec:
  datacenter: datacenter
  insecure: false
  network: workload
  server: 192.168.8.12
  thumbprint: "27:44:A2:74:89:B4:D3:4E:97:30:D7:AF:3B:88:06:F4:08:0C:4F:D7"

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: awsemu-cp
spec:
  cloneMode: linkedClone
  datastore: vsandatastore
  folder: Kubernetes/Management/Control Plane
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: /datacenter/host/hwcluster/Resources
  storagePolicyName: ""
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDGidVzdPHSLPNq7i4+r1AD2bfAQmEC8NmZM1V0vN7jMIW2QZSflL2LrCpGk0969FHesOUTM1x61B5oYepsLjYgSKDC2mNxIg2jZONPYCg30fxE5vOxWUJObCGuc4trKfz9DLPx7+C3fGgXQaFmnugMgRbqYurdrr8HDeXsavwN361x/MesKpY4E26SBt/RG/sZEssVnzeIPbM8S9LDOX62znFYIXRlgmmx9un68TqQpMti6CnIWUlYwx90MJkV0avL5BeSg9ex3JxYH1THQw3tcj5gyh9GY9yWVxXA7bs3wh5vd8JAJEtPpeqaafRaqXfBFWzC3/L21GxVCwgvGAjovhdDGk3vn6PNRKf4b1MydHnVK7/lZnpNpenDYCszSEebkS5joqehpkaJ4eED1ACvJeh/0urupu47RMN6DcwLUR7j3o7sxcXZK31lecgogC7yvC5eZGK/B6rwHyV3xX7KaVcfabJJeiiJgrb2cKesiKDFgR8DlQ+sUrdwUIcsxsoOskYZJQuvH/h2Gi7lZv71uABnQLvcAeF6OSj7vnrsQ7oUKdcJhAfoRdJCOEt1PtgyDfe2WJ9gH3KRbuHxnNVyQKNZaI5OtEPCxlPIyXbGQnsTwZ1AiWj/RYbj3DP3aCM3Iu7Lg7z/dVGSnRfWJk0zdcZekGch0O43H0EX7611kQ==

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: awsemu
spec:
  cloneMode: linkedClone
  datastore: vsandatastore
  folder: Kubernetes/Management/Worker Nodes
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: /datacenter/host/hwcluster/Resources
  storagePolicyName: ""
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDGidVzdPHSLPNq7i4+r1AD2bfAQmEC8NmZM1V0vN7jMIW2QZSflL2LrCpGk0969FHesOUTM1x61B5oYepsLjYgSKDC2mNxIg2jZONPYCg30fxE5vOxWUJObCGuc4trKfz9DLPx7+C3fGgXQaFmnugMgRbqYurdrr8HDeXsavwN361x/MesKpY4E26SBt/RG/sZEssVnzeIPbM8S9LDOX62znFYIXRlgmmx9un68TqQpMti6CnIWUlYwx90MJkV0avL5BeSg9ex3JxYH1THQw3tcj5gyh9GY9yWVxXA7bs3wh5vd8JAJEtPpeqaafRaqXfBFWzC3/L21GxVCwgvGAjovhdDGk3vn6PNRKf4b1MydHnVK7/lZnpNpenDYCszSEebkS5joqehpkaJ4eED1ACvJeh/0urupu47RMN6DcwLUR7j3o7sxcXZK31lecgogC7yvC5eZGK/B6rwHyV3xX7KaVcfabJJeiiJgrb2cKesiKDFgR8DlQ+sUrdwUIcsxsoOskYZJQuvH/h2Gi7lZv71uABnQLvcAeF6OSj7vnrsQ7oUKdcJhAfoRdJCOEt1PtgyDfe2WJ9gH3KRbuHxnNVyQKNZaI5OtEPCxlPIyXbGQnsTwZ1AiWj/RYbj3DP3aCM3Iu7Lg7z/dVGSnRfWJk0zdcZekGch0O43H0EX7611kQ==

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: awsemu-etcd
spec:
  cloneMode: linkedClone
  datastore: vsandatastore
  folder: Kubernetes/Management/ETCD
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: /datacenter/host/hwcluster/Resources
  storagePolicyName: ""
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDGidVzdPHSLPNq7i4+r1AD2bfAQmEC8NmZM1V0vN7jMIW2QZSflL2LrCpGk0969FHesOUTM1x61B5oYepsLjYgSKDC2mNxIg2jZONPYCg30fxE5vOxWUJObCGuc4trKfz9DLPx7+C3fGgXQaFmnugMgRbqYurdrr8HDeXsavwN361x/MesKpY4E26SBt/RG/sZEssVnzeIPbM8S9LDOX62znFYIXRlgmmx9un68TqQpMti6CnIWUlYwx90MJkV0avL5BeSg9ex3JxYH1THQw3tcj5gyh9GY9yWVxXA7bs3wh5vd8JAJEtPpeqaafRaqXfBFWzC3/L21GxVCwgvGAjovhdDGk3vn6PNRKf4b1MydHnVK7/lZnpNpenDYCszSEebkS5joqehpkaJ4eED1ACvJeh/0urupu47RMN6DcwLUR7j3o7sxcXZK31lecgogC7yvC5eZGK/B6rwHyV3xX7KaVcfabJJeiiJgrb2cKesiKDFgR8DlQ+sUrdwUIcsxsoOskYZJQuvH/h2Gi7lZv71uABnQLvcAeF6OSj7vnrsQ7oUKdcJhAfoRdJCOEt1PtgyDfe2WJ9gH3KRbuHxnNVyQKNZaI5OtEPCxlPIyXbGQnsTwZ1AiWj/RYbj3DP3aCM3Iu7Lg7z/dVGSnRfWJk0zdcZekGch0O43H0EX7611kQ==

---

This also leaves all VMs created behind and the cluster in a state that it isn't ready nor can I delete with eksctl so all we can do is to manually stop and delete each VM...

Apr 13 '24 00:04 galvesribeiro

galvesribeiro Can you try fullclone instead linkedclone ? Also the CP node ip address is set to "172.16.1.1" ? Is it your vlan gateway IP ? And does your EKS-A vlan have access to your vCenter API endpoint ? I

Apr 13 '24 01:04 Darth-Weider

@Darth-Weider

Can you try fullclone instead linkedclone

Full clone is what was causing vSphere to fail with that message as you see the picture (A specified parameter was not correct: spec.config.deviceChange[0].operation). I was only able to make it pass thru it and deploy the VMs with linkedClone. Otherwise, that error appears on vSphere and the EKS-A CLI keep looping "waiting" for the Etcd to get ready which clearly would never happen 😄.

Also the CP node ip address is set to "172.16.1.1" ? Is it your vlan gateway IP?

No. The network is:

Address space: 172.16.0.0/16
Gateway/DNS: 172.16.0.1
DHCP range: 172.16.2.1-254
CP: 172.16.1.1

And does your EKS-A vlan have access to your vCenter API endpoint ?

Yep. vCenter is 192.168.8.12 which is routable thru the 172.16.0.1 gateway.

Apr 13 '24 02:04 galvesribeiro

eks-anywhere eks-anywhere copied to clipboard

Unable to deploy EKS-A to vSphere cluster

eks-anywhere
eks-anywhere copied to clipboard