eks-anywhere
eks-anywhere copied to clipboard
Unable to deploy EKS-A to vSphere cluster
What happened: Unable to deploy EKS-A on ESXI 8 U1
What you expected to happen: The initial cluster to be deployed
How to reproduce it (as minimally and precisely as possible): Just follow the process from the documentation to deploy the initial cluster.
When it tries to deploy the first etcd VM from the templates, the VM is created, but then briefly after creation it is removed and I see the following error:
A specified parameter was not correct: spec.config.deviceChange[0].operation
I've tried with multiple BR versions starting from 1.26 to 1.29 and all of them fail. Also tried on two completely separated ESXI/vSphere clusters with the same results.
Environment: Latest EKS-A CLI (from brew) on macOS Sonoma (fully updated) deploying to ESXi/vCenter/vSAN 8U1.
I would check the spec: datastore:
setting first. Can you post your cluster manifest ?
@Darth-Weider thanks for the reply.
For those which are having similar issues here is a TL;DR;:
- Set the
VSphereMachineConfig.spec.cloneMode
tolinkedClone
AND remove thediskGiB
which is added by default when you callgenerate
command on the CLI. -
pods.cidrBlocks
andservices.cidrBlocks
should NOT collide with the DHCP range as well. The DHCP range is ONLY used for the VM IPs.
I've just finally figured out what is going on here. There were a few things that weren't really clear when reading the docs:
- The
VSphereMachineConfig.spec.cloneMode
when not set, and thediskGiB
set to anything, as it is when thegenerate
command run, would throw that error. Then we setcloneMode
tolinkedClone
, it then fail the validation saying that we shouldn't set thediskGiB
. We then removed thediskGiB
and it worked. The images were deployed just fine. - The next failure was that it complains about the IP of the control plane not being unique when I'm pretty sure it was since I've created a single VLAN/subnet like 10.254.0.0/16 specifically for EKS-A. Then, besides .1 which is the gateway, I've set from .1 to .100 to be excluded from the DHCP range, and made the .10 the IP of the control plane VIP and it keep saying the .10 was in use. I then ran with the
--skip-ip-verification
as suggested by another issue here, and it passed thru but, theEtcd
nodes never got ready for whatever reason and the process good looping waiting for it to be ready. It turns out that the documentation don't make clear that thispods/services.cidrBlocks
must be any network that (1) doesn't collide with the host and other subnets on your physical network AND (2) is not part of that DHCP range (because of (1)). So as soon as I created a VLAN with 10.170.0.0/24, made the control plane VIP .10, the DHCP range to be .100-.154 and kept thepods.cidrBlocks
to 10.254.0.0/16 (which is a non-routable address space on the physical network), it just worked. The reason of this confusion is that other k8s distros I've used use different CNIs where pods and/or services get IPs on the underlying network, but that is not the case with EKS-A.
The fun fact is that this is not consistent. I've created the same config multiple times on the same environment and sometimes the process fail in the end with "Creating EKS-A namespace" "The connection to the server localhost:8080 was refused"
which makes no sense as I don't have anything open on localhost at 8080.
Here is the config:
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: awsemu
spec:
clusterNetwork:
cniConfig:
cilium: {}
pods:
cidrBlocks:
- 172.18.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
controlPlaneConfiguration:
count: 3
endpoint:
host: "172.16.1.1"
machineGroupRef:
kind: VSphereMachineConfig
name: awsemu-cp
datacenterRef:
kind: VSphereDatacenterConfig
name: datacenter
externalEtcdConfiguration:
count: 3
machineGroupRef:
kind: VSphereMachineConfig
name: awsemu-etcd
kubernetesVersion: "1.29"
managementCluster:
name: awsemu
workerNodeGroupConfigurations:
- count: 1
machineGroupRef:
kind: VSphereMachineConfig
name: awsemu
name: md-0
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
name: datacenter
spec:
datacenter: datacenter
insecure: false
network: workload
server: 192.168.8.12
thumbprint: "27:44:A2:74:89:B4:D3:4E:97:30:D7:AF:3B:88:06:F4:08:0C:4F:D7"
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
name: awsemu-cp
spec:
cloneMode: linkedClone
datastore: vsandatastore
folder: Kubernetes/Management/Control Plane
memoryMiB: 8192
numCPUs: 2
osFamily: bottlerocket
resourcePool: /datacenter/host/hwcluster/Resources
storagePolicyName: ""
users:
- name: ec2-user
sshAuthorizedKeys:
- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDGidVzdPHSLPNq7i4+r1AD2bfAQmEC8NmZM1V0vN7jMIW2QZSflL2LrCpGk0969FHesOUTM1x61B5oYepsLjYgSKDC2mNxIg2jZONPYCg30fxE5vOxWUJObCGuc4trKfz9DLPx7+C3fGgXQaFmnugMgRbqYurdrr8HDeXsavwN361x/MesKpY4E26SBt/RG/sZEssVnzeIPbM8S9LDOX62znFYIXRlgmmx9un68TqQpMti6CnIWUlYwx90MJkV0avL5BeSg9ex3JxYH1THQw3tcj5gyh9GY9yWVxXA7bs3wh5vd8JAJEtPpeqaafRaqXfBFWzC3/L21GxVCwgvGAjovhdDGk3vn6PNRKf4b1MydHnVK7/lZnpNpenDYCszSEebkS5joqehpkaJ4eED1ACvJeh/0urupu47RMN6DcwLUR7j3o7sxcXZK31lecgogC7yvC5eZGK/B6rwHyV3xX7KaVcfabJJeiiJgrb2cKesiKDFgR8DlQ+sUrdwUIcsxsoOskYZJQuvH/h2Gi7lZv71uABnQLvcAeF6OSj7vnrsQ7oUKdcJhAfoRdJCOEt1PtgyDfe2WJ9gH3KRbuHxnNVyQKNZaI5OtEPCxlPIyXbGQnsTwZ1AiWj/RYbj3DP3aCM3Iu7Lg7z/dVGSnRfWJk0zdcZekGch0O43H0EX7611kQ==
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
name: awsemu
spec:
cloneMode: linkedClone
datastore: vsandatastore
folder: Kubernetes/Management/Worker Nodes
memoryMiB: 8192
numCPUs: 2
osFamily: bottlerocket
resourcePool: /datacenter/host/hwcluster/Resources
storagePolicyName: ""
users:
- name: ec2-user
sshAuthorizedKeys:
- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDGidVzdPHSLPNq7i4+r1AD2bfAQmEC8NmZM1V0vN7jMIW2QZSflL2LrCpGk0969FHesOUTM1x61B5oYepsLjYgSKDC2mNxIg2jZONPYCg30fxE5vOxWUJObCGuc4trKfz9DLPx7+C3fGgXQaFmnugMgRbqYurdrr8HDeXsavwN361x/MesKpY4E26SBt/RG/sZEssVnzeIPbM8S9LDOX62znFYIXRlgmmx9un68TqQpMti6CnIWUlYwx90MJkV0avL5BeSg9ex3JxYH1THQw3tcj5gyh9GY9yWVxXA7bs3wh5vd8JAJEtPpeqaafRaqXfBFWzC3/L21GxVCwgvGAjovhdDGk3vn6PNRKf4b1MydHnVK7/lZnpNpenDYCszSEebkS5joqehpkaJ4eED1ACvJeh/0urupu47RMN6DcwLUR7j3o7sxcXZK31lecgogC7yvC5eZGK/B6rwHyV3xX7KaVcfabJJeiiJgrb2cKesiKDFgR8DlQ+sUrdwUIcsxsoOskYZJQuvH/h2Gi7lZv71uABnQLvcAeF6OSj7vnrsQ7oUKdcJhAfoRdJCOEt1PtgyDfe2WJ9gH3KRbuHxnNVyQKNZaI5OtEPCxlPIyXbGQnsTwZ1AiWj/RYbj3DP3aCM3Iu7Lg7z/dVGSnRfWJk0zdcZekGch0O43H0EX7611kQ==
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
name: awsemu-etcd
spec:
cloneMode: linkedClone
datastore: vsandatastore
folder: Kubernetes/Management/ETCD
memoryMiB: 8192
numCPUs: 2
osFamily: bottlerocket
resourcePool: /datacenter/host/hwcluster/Resources
storagePolicyName: ""
users:
- name: ec2-user
sshAuthorizedKeys:
- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDGidVzdPHSLPNq7i4+r1AD2bfAQmEC8NmZM1V0vN7jMIW2QZSflL2LrCpGk0969FHesOUTM1x61B5oYepsLjYgSKDC2mNxIg2jZONPYCg30fxE5vOxWUJObCGuc4trKfz9DLPx7+C3fGgXQaFmnugMgRbqYurdrr8HDeXsavwN361x/MesKpY4E26SBt/RG/sZEssVnzeIPbM8S9LDOX62znFYIXRlgmmx9un68TqQpMti6CnIWUlYwx90MJkV0avL5BeSg9ex3JxYH1THQw3tcj5gyh9GY9yWVxXA7bs3wh5vd8JAJEtPpeqaafRaqXfBFWzC3/L21GxVCwgvGAjovhdDGk3vn6PNRKf4b1MydHnVK7/lZnpNpenDYCszSEebkS5joqehpkaJ4eED1ACvJeh/0urupu47RMN6DcwLUR7j3o7sxcXZK31lecgogC7yvC5eZGK/B6rwHyV3xX7KaVcfabJJeiiJgrb2cKesiKDFgR8DlQ+sUrdwUIcsxsoOskYZJQuvH/h2Gi7lZv71uABnQLvcAeF6OSj7vnrsQ7oUKdcJhAfoRdJCOEt1PtgyDfe2WJ9gH3KRbuHxnNVyQKNZaI5OtEPCxlPIyXbGQnsTwZ1AiWj/RYbj3DP3aCM3Iu7Lg7z/dVGSnRfWJk0zdcZekGch0O43H0EX7611kQ==
---
This also leaves all VMs created behind and the cluster in a state that it isn't ready nor can I delete with eksctl so all we can do is to manually stop and delete each VM...
galvesribeiro Can you try fullclone instead linkedclone ? Also the CP node ip address is set to "172.16.1.1" ? Is it your vlan gateway IP ? And does your EKS-A vlan have access to your vCenter API endpoint ? I
@Darth-Weider
Can you try fullclone instead linkedclone
Full clone is what was causing vSphere to fail with that message as you see the picture (A specified parameter was not correct: spec.config.deviceChange[0].operation). I was only able to make it pass thru it and deploy the VMs with linkedClone
. Otherwise, that error appears on vSphere and the EKS-A CLI keep looping "waiting" for the Etcd to get ready which clearly would never happen 😄.
Also the CP node ip address is set to "172.16.1.1" ? Is it your vlan gateway IP?
No. The network is:
- Address space: 172.16.0.0/16
- Gateway/DNS: 172.16.0.1
- DHCP range: 172.16.2.1-254
- CP: 172.16.1.1
And does your EKS-A vlan have access to your vCenter API endpoint ?
Yep. vCenter is 192.168.8.12 which is routable thru the 172.16.0.1 gateway.