eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

Kubernetes API server does not start if DHCP assigns new IP addresses diff from originals

Open echel0n opened this issue 2 years ago • 3 comments

What happened: DHCP leases expired for cluster and assigned new IP addresses, Kubernetes API server would not start after, error found in journal was kubelet.go:2451] "Error getting node" err="node \"192.168.3.192\" not found"

What you expected to happen: Cluster to detect new IP addresses assigned by DHCP and adjust accordingly so API server and other services can come up

How to reproduce it (as minimally and precisely as possible): Create a new EKSA cluster, shut it down, delete DHCP leases so NEW IP addresses get assigned, restart cluster

Anything else we need to know?: NODE_IP is hard-coded with original DHCP IP address in the env file located under /etc/kubernetes/kubelet, by statically assigning the original address via DHCP, I was able to get the cluster working again.

Environment:

  • EKS Anywhere Release: v0.10.1
  • EKS Distro Release: 1.22

echel0n avatar Aug 05 '22 17:08 echel0n

hey @echel0n, I'm sorry to hear you ran into this issue. Did this impact worker nodes as well as control plane nodes, or just control plane nodes? Is the address 192.168.3.192 the value you provided as the control plane endpoint host in your cluster configuration? If you could include a sanitized copy of your cluster config that'd be helpful as well.

I ask as we specifically recommend that the IP address provided to your control plane via the Control Plane Configuration by excluded from the DHCP range. For more information, check out:

  • https://anywhere.eks.amazonaws.com/docs/reference/vsphere/vsphere-prereq/#:~:text=Below%20are%20some,existent%20mac%20address.
  • https://anywhere.eks.amazonaws.com/docs/reference/clusterspec/vsphere/#controlplaneconfigurationendpointhost-required

danbudris avatar Aug 16 '22 15:08 danbudris

hi @danbudris, it happened again, seems that if any of the control plane or etcd nodes end up with a new DHCP assigned IP address that differs from their original IP address assigned at cluster creation, you lose access to the API server and no longer can control the cluster, the worker nodes seem fine, for me to resolve this, I have to go back and add DHCP static mappings of the original IP's to the mac addresses for the control plane and etcd nodes, then restart the VMs, after this I then have access again, 192.168.3.192 is not the control plane endpoint, below you can see a sanitized version of my cluster config, thanks!

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: prod
  namespace: default
spec:
  bundlesRef:
    apiVersion: anywhere.eks.amazonaws.com/v1alpha1
    name: bundles-12
    namespace: eksa-system
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
      - 10.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 2
    endpoint:
      host: 192.168.3.10
    machineGroupRef:
      kind: VSphereMachineConfig
      name: prod-cp
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: prod
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: prod-etcd
  kubernetesVersion: "1.22"
  managementCluster:
    name: prod
  workerNodeGroupConfigurations:
  - count: 4
    machineGroupRef:
      kind: VSphereMachineConfig
      name: prod
    name: md-0

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: prod
  namespace: default
spec:
  datacenter: Dark Systems Datacenter
  insecure: true
  network: /Dark Systems Datacenter/network/DSwitch-10GB-EKS
  server: vcenter.vsphere.darksystems.ca
  thumbprint: ""

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  annotations:
    anywhere.eks.amazonaws.com/control-plane: "true"
  name: prod-cp
  namespace: default
spec:
  datastore: /Dark Systems Datacenter/datastore/vsanDatastore
  diskGiB: 25
  folder: /Dark Systems Datacenter/vm/EKS Anywhere
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: /Dark Systems Datacenter/host/Cluster/Resources
  template: /Dark Systems Datacenter/vm/Templates/bottlerocket-vmware-k8s-1.22-x86_64-1.8.0-a6233c22
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ""

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  annotations:
    anywhere.eks.amazonaws.com/etcd: "true"
  name: prod-etcd
  namespace: default
spec:
  datastore: /Dark Systems Datacenter/datastore/vsanDatastore
  diskGiB: 25
  folder: /Dark Systems Datacenter/vm/EKS Anywhere
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: /Dark Systems Datacenter/host/Cluster/Resources
  template: /Dark Systems Datacenter/vm/Templates/bottlerocket-vmware-k8s-1.22-x86_64-1.8.0-a6233c22
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ""

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: prod
  namespace: default
spec:
  datastore: /Dark Systems Datacenter/datastore/vsanDatastore
  diskGiB: 50
  folder: /Dark Systems Datacenter/vm/EKS Anywhere
  memoryMiB: 16384
  numCPUs: 16
  osFamily: bottlerocket
  resourcePool: /Dark Systems Datacenter/host/Cluster/Resources
  template: /Dark Systems Datacenter/vm/Templates/bottlerocket-vmware-k8s-1.22-x86_64-1.8.0-a6233c22
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ""

echel0n avatar Aug 16 '22 16:08 echel0n

any update on this ?

echel0n avatar Sep 13 '22 23:09 echel0n

when this happens in our ubuntu eksa cluster i just change ip to match the new one in the deployment file "/etc/kubernetes/manifests/kube-apiserver.yaml" on the controlplane node. Then delete the pod and it solves the problem.

danielem37 avatar Oct 24 '22 06:10 danielem37