eks-anywhere
eks-anywhere copied to clipboard
kubeadm-bootstrap fails on vSphere AirGapped Bottlerocket Deployment
What happened:
I have been trying to deploy EKS-A on an airgapped vSphere deployment and having little success. I am following the guides, but have not been able to get past the etcd node build. It looks like the kubeadm-bootstrap is failing with the following error:
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: -----END RSA PRIVATE KEY-----
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: } {Path:/run/cluster-api/placeholder Owner:root:root Permissions:0640 Content:This placeholder file is used to create the /run/cluster-api sub directory in a way that is compatible with both Linux and Windows (mkdir -p /run/cluster-api does not work with Windows)}] RunCmd:EtcdadmInit public.ecr.aws/eks-distro/etcd-io/etcd 3.5.15-eks-1-31-7 TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256}
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Using etcdadm support by CAPI
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Writing userdata write files
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Running etcdadm init phases
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Running etcdadm init install phase
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Phase command output:
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: --------
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=info msg="[install] Removing existing data dir \"/var/lib/etcd/data\""
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: --------
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Running etcdadm init certificates phase
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: Stopping Host container: kubeadm-bootstrap...
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=info msg="received signal: terminated"
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=info msg="container task exited" code=143
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=fatal msg="Container kubeadm-bootstrap exited with non-zero status"
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: [email protected]: Failed with result 'exit-code'.
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: Stopped Host container: kubeadm-bootstrap.
This appears to be at the 'etcdadm init certificates phase` with the container failing. This then means none of the other etcd nodes can join (as they cannot find the cert server) and the cluster creation fails.
I have tried this on a number of different versions of eks-anywhere:
- v0.20.0 (bundle 68)
- v0.20.9 (bundle 78)
- v0.21.1 (bundle 83)
Sadly I dont know enough about how these certs are generated or the internals of the bootstrap-container, but each attempted creation appears to fail at the same stage.
What you expected to happen:
The initialisation completes and the etcd nodes are healthy.
How to reproduce it (as minimally and precisely as possible):
To stick with the latest version of eks-a:
- Download the version of eks-anywhere
- Download the artifacts as described
- Download the images as described above
- Download Bottlerocket OS and Upload to vSphere as a template
This is the mgmt cluster configuration:
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: xxxxx-mgmt
spec:
registryMirrorConfiguration:
endpoint: harbor.xxx.xxxxx.com
port: 443
authenticate: false
caCertContent: |
-----BEGIN CERTIFICATE-----
xxxxxxx
JsRPFd8GD+ZAEnOQ
-----END CERTIFICATE-----
clusterNetwork:
cniConfig:
cilium: {}
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
controlPlaneConfiguration:
count: 2
endpoint:
host: "10.xxx.xxx.xxx"
machineGroupRef:
kind: VSphereMachineConfig
name: xxxxx-mgmt-cp
datacenterRef:
kind: VSphereDatacenterConfig
name: xxxxx-mgmt
externalEtcdConfiguration:
count: 3
machineGroupRef:
kind: VSphereMachineConfig
name: xxxxx-mgmt-etcd
kubernetesVersion: "1.31"
managementCluster:
name: xxxxx-mgmt
workerNodeGroupConfigurations:
- count: 2
machineGroupRef:
kind: VSphereMachineConfig
name: xxxxx-mgmt
name: md-0
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
name: xxxxx-mgmt
spec:
datacenter: "xxx"
insecure: true
network: "Kubernetes"
server: "xxx-xxx-vca-01.xxx.xxxxx.com"
thumbprint: "xxxxxx"
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
name: xxxxx-mgmt-cp
spec:
datastore: "Datastore_01_C1"
diskGiB: 25
folder: "eksa/xxxxx-mgmt"
memoryMiB: 8192
numCPUs: 2
osFamily: bottlerocket
template: "bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2-xxxxx"
resourcePool: "/xxx/host/xxx-xxx-vcl-01/Resources"
users:
- name: ec2-user
sshAuthorizedKeys:
- ssh-rsa xxx ec2-user
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
name: xxxxx-mgmt
spec:
datastore: "Datastore_01_C1"
diskGiB: 150
folder: "eksa/xxxxx-mgmt"
memoryMiB: 8192
numCPUs: 4
osFamily: bottlerocket
template: "bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2-xxxxx"
resourcePool: "/xxx/host/xxx-xxx-vcl-01/Resources"
users:
- name: ec2-user
sshAuthorizedKeys:
- ssh-rsa xxx ec2-user
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
name: xxxxx-mgmt-etcd
spec:
datastore: "Datastore_01_C1"
diskGiB: 25
folder: "eksa/xxxxx-mgmt"
memoryMiB: 8192
numCPUs: 2
osFamily: bottlerocket
template: "bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2-xxxxx"
resourcePool: "/xxx/host/xxx-xxx-vcl-01/Resources"
users:
- name: ec2-user
sshAuthorizedKeys:
- ssh-rsa xxx ec2-user
---
Anything else we need to know?:
Environment:
- EKS Anywhere Release: v0.21.1
- EKS Distro Release: v1.31
- Operating Systems: ubuntu 22.04 and Fedora 41
- Bottlerocket Version: bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2
So I eventually managed to track down the issue - through grepping through the logs, I managed to find the following errors:
Dec 10 16:07:36 localhost systemd-tmpfiles[1652]: Reading config file "/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/tmpfiles.d/release-ca-certificates.conf"…
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: Running etcdadm init certificates phase
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=info msg="[certificates] creating PKI assets"
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=fatal msg="[certificates] failed creating PKI assets: failure loading ca certificate: the certificate is not valid yet"
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: Error running bootstrapper cmd: error running etcdadm phase 'init certificates', out:
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=info msg="[certificates] creating PKI assets"
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=fatal msg="[certificates] failed creating PKI assets: failure loading ca certificate: the certificate is not valid yet"
Dec 10 16:08:57 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[2474]: Running etcdadm init certificates phase
As this was a disconnected cluster this lead me to the fact the time was out of sync - the admin machine I was using was 8 minutes ahead of the ESXI hosts and therefore the certificate was not being accepted and causing the crash.
I dont know if this is possible to either add to the airgapped requirements / troubleshooting so that this could possibly be identified by others in the future.