eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

kubeadm-bootstrap fails on vSphere AirGapped Bottlerocket Deployment

Open geoffo-dev opened this issue 11 months ago • 1 comments
trafficstars

What happened:

I have been trying to deploy EKS-A on an airgapped vSphere deployment and having little success. I am following the guides, but have not been able to get past the etcd node build. It looks like the kubeadm-bootstrap is failing with the following error:

Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: -----END RSA PRIVATE KEY-----
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: } {Path:/run/cluster-api/placeholder Owner:root:root Permissions:0640 Content:This placeholder file is used to create the /run/cluster-api sub directory in a way that is compatible with both Linux and Windows (mkdir -p /run/cluster-api does not work with Windows)}] RunCmd:EtcdadmInit public.ecr.aws/eks-distro/etcd-io/etcd 3.5.15-eks-1-31-7 TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256}
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Using etcdadm support by CAPI
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Writing userdata write files
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Running etcdadm init phases
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Running etcdadm init install phase
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Phase command output:
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: --------
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=info msg="[install] Removing existing data dir \"/var/lib/etcd/data\""
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: --------
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Running etcdadm init certificates phase
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: Stopping Host container: kubeadm-bootstrap...
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=info msg="received signal: terminated"
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=info msg="container task exited" code=143
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=fatal msg="Container kubeadm-bootstrap exited with non-zero status"
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: [email protected]: Failed with result 'exit-code'.
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: Stopped Host container: kubeadm-bootstrap.

This appears to be at the 'etcdadm init certificates phase` with the container failing. This then means none of the other etcd nodes can join (as they cannot find the cert server) and the cluster creation fails.

I have tried this on a number of different versions of eks-anywhere:

  • v0.20.0 (bundle 68)
  • v0.20.9 (bundle 78)
  • v0.21.1 (bundle 83)

Sadly I dont know enough about how these certs are generated or the internals of the bootstrap-container, but each attempted creation appears to fail at the same stage.

What you expected to happen:

The initialisation completes and the etcd nodes are healthy.

How to reproduce it (as minimally and precisely as possible):

To stick with the latest version of eks-a:

  • Download the version of eks-anywhere
  • Download the artifacts as described
  • Download the images as described above
  • Download Bottlerocket OS and Upload to vSphere as a template

This is the mgmt cluster configuration:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: xxxxx-mgmt
spec:
  registryMirrorConfiguration:
    endpoint: harbor.xxx.xxxxx.com
    port: 443
    authenticate: false
    caCertContent: |
      -----BEGIN CERTIFICATE-----
      xxxxxxx
      JsRPFd8GD+ZAEnOQ
      -----END CERTIFICATE-----
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 2
    endpoint:
      host: "10.xxx.xxx.xxx"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: xxxxx-mgmt-cp
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: xxxxx-mgmt
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: xxxxx-mgmt-etcd
  kubernetesVersion: "1.31"
  managementCluster:
    name: xxxxx-mgmt
  workerNodeGroupConfigurations:
  - count: 2
    machineGroupRef:
      kind: VSphereMachineConfig
      name: xxxxx-mgmt
    name: md-0

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: xxxxx-mgmt
spec:
  datacenter: "xxx"
  insecure: true
  network: "Kubernetes"
  server: "xxx-xxx-vca-01.xxx.xxxxx.com"
  thumbprint: "xxxxxx"

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: xxxxx-mgmt-cp
spec:
  datastore: "Datastore_01_C1"
  diskGiB: 25
  folder: "eksa/xxxxx-mgmt"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  template: "bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2-xxxxx"
  resourcePool: "/xxx/host/xxx-xxx-vcl-01/Resources"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa xxx ec2-user  

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: xxxxx-mgmt
spec:
  datastore: "Datastore_01_C1"
  diskGiB: 150
  folder: "eksa/xxxxx-mgmt"
  memoryMiB: 8192
  numCPUs: 4
  osFamily: bottlerocket
  template: "bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2-xxxxx"
  resourcePool: "/xxx/host/xxx-xxx-vcl-01/Resources"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa xxx ec2-user  
  
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: xxxxx-mgmt-etcd
spec:
  datastore: "Datastore_01_C1"
  diskGiB: 25
  folder: "eksa/xxxxx-mgmt"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  template: "bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2-xxxxx"
  resourcePool: "/xxx/host/xxx-xxx-vcl-01/Resources"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa xxx ec2-user  

---

Anything else we need to know?:

Environment:

  • EKS Anywhere Release: v0.21.1
  • EKS Distro Release: v1.31
  • Operating Systems: ubuntu 22.04 and Fedora 41
  • Bottlerocket Version: bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2

geoffo-dev avatar Dec 04 '24 18:12 geoffo-dev

So I eventually managed to track down the issue - through grepping through the logs, I managed to find the following errors:

Dec 10 16:07:36 localhost systemd-tmpfiles[1652]: Reading config file "/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/tmpfiles.d/release-ca-certificates.conf"…
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: Running etcdadm init certificates phase
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=info msg="[certificates] creating PKI assets"
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=fatal msg="[certificates] failed creating PKI assets: failure loading ca certificate: the certificate is not valid yet"
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: Error running bootstrapper cmd: error running etcdadm phase 'init certificates', out:
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]:  time="2024-12-10T16:08:00Z" level=info msg="[certificates] creating PKI assets"
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=fatal msg="[certificates] failed creating PKI assets: failure loading ca certificate: the certificate is not valid yet"
Dec 10 16:08:57 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[2474]: Running etcdadm init certificates phase

As this was a disconnected cluster this lead me to the fact the time was out of sync - the admin machine I was using was 8 minutes ahead of the ESXI hosts and therefore the certificate was not being accepted and causing the crash.

I dont know if this is possible to either add to the airgapped requirements / troubleshooting so that this could possibly be identified by others in the future.

geoffo-dev avatar Dec 10 '24 18:12 geoffo-dev