Unable to create cluster in Docker or Nutanix
What happened:
Today I unable to create EKS cluster locally or on Nutanix. Inspecting the logs I get the following error. Both has been happening on Nutanix and Docker (probably as it first gets build in Docker).
On each occasion it seems to have an issue with creating control plane or accessing control plane.
Please kind in mind I have previously created clusters with no issues. I have today tried several version of eks anywhere from latest to version v0.19.0. Also tried several docker versions.
Any advice would be great.
Also tried my home machine as well as work to rule out any networking issues.
{"T":1741646300318754241,"M":"Executing command","cmd":"/usr/bin/docker exec -i eksa_1741646029888252550 kubectl get --ignore-not-found -o json --kubeconfig docker/generated/docker.kind.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default docker"}
{"T":1741646300466424594,"M":"Cluster generation and observedGeneration","Generation":1,"ObservedGeneration":1}
{"T":1741646300466484939,"M":"Error happened during retry","error":"cluster condition ControlPlaneReady is False: Control plane nodes not ready yet, 1 expected (0 ready)","retries":50}
{"T":1741646300466502600,"M":"Sleeping before next retry","time":"1s"}
{"T":1741646301466735454,"M":"Executing command","cmd":"/usr/bin/docker exec -i eksa_1741646029888252550 kubectl get --ignore-not-found -o json --kubeconfig docker/generated/docker.kind.kubeconfig Cluster.v1alph
I have previously created clusters with no issues. While having issues creating cluster on Nutanix, I tried using docker as provisioner.. Both experiencing issues.
Docker version
Client:
Version: 26.1.3
API version: 1.45
Go version: go1.22.2
Git commit: 26.1.3-0ubuntu1~24.04.1
Built: Mon Oct 14 14:29:26 2024
OS/Arch: linux/amd64
Context: default
Server:
Engine:
Version: 26.1.3
API version: 1.45 (minimum version 1.24)
Go version: go1.22.2
Git commit: 26.1.3-0ubuntu1~24.04.1
Built: Mon Oct 14 14:29:26 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.24
GitCommit:
runc:
Version: 1.1.12-0ubuntu3.1
GitCommit:
docker-init:
Version: 0.19.0
GitCommit:
I went back to eks anywhere v0.18.4 which got a bit further. Creates VM's in Nutanix and seems to create Kubernetes cluster but it doesn't seem to copy any of the EKS Anywhere from eksa-system.
Basically in checkpoint it says all is null...
2025-03-12T19:20:37.635Z V6 Executing command {"cmd": "/usr/bin/docker exec -i eksa_1741806693342302661 kubectl apply -f - --kubeconfig mar-lab-eks-1/mar-lab-eks-1-eks-a-cluster.kubeconfig"}
2025-03-12T19:20:38.121Z V5 Retry execution successful {"retries": 1, "duration": "486.076278ms"}
2025-03-12T19:20:38.121Z V0 ⏳ Collecting support bundle from cluster, this can take a while {"cluster": "mar-lab-eks-1", "bundle": "mar-lab-eks-1/generated/mar-lab-eks-1-2025-03-12T19:20:37Z-bundle.yaml", "since": "2025-03-12T16:20:37.454Z", "kubeconfig": "mar-lab-eks-1/mar-lab-eks-1-eks-a-cluster.kubeconfig"}
2025-03-12T19:20:38.121Z V6 Executing command {"cmd": "/usr/bin/docker exec -i eksa_1741806693342302661 support-bundle mar-lab-eks-1/generated/mar-lab-eks-1-2025-03-12T19:20:37Z-bundle.yaml --kubeconfig mar-lab-eks-1/mar-lab-eks-1-eks-a-cluster.kubeconfig --interactive=false --since-time 2025-03-12T16:20:37.454335648Z"}
2025-03-12T19:23:34.551Z V9 docker {"stderr": "Error: failed to run collect and analyze process: failed to run collectors: failed to redact in cluster collector results: failed to decompress file: unexpected EOF\n"}
2025-03-12T19:23:34.551Z V5 Error collecting and saving logs {"error": "failed to Collect support bundle: executing support-bundle: Error: failed to run collect and analyze process: failed to run collectors: failed to redact in cluster collector results: failed to decompress file: unexpected EOF\n"}
2025-03-12T19:23:34.551Z V4 Task finished {"task_name": "collect-cluster-diagnostics", "duration": "3m12.060411663s"}
2025-03-12T19:23:34.551Z V4 ----------------------------------
2025-03-12T19:23:34.551Z V4 Saving checkpoint {"file": "mar-lab-eks-1-checkpoint.yaml"}
2025-03-12T19:23:34.551Z V4 Tasks completed {"duration": "11m50.184723194s"}
2025-03-12T19:23:34.551Z V3 Cleaning up long running container {"name": "eksa_1741806693342302661"}
2025-03-12T19:23:34.551Z V6 Executing command {"cmd": "/usr/bin/docker rm -f -v eksa_1741806693342302661"}
Error: moving CAPI management from source to target: failed moving management cluster: Performing move...
Discovering Cluster API objects
Moving Cluster API objects Clusters=1
Moving Cluster API objects ClusterClasses=0
Creating objects in the target cluster
Error: [action failed after 10 attempts: error creating "[bootstrap.cluster.x-k8s.io/v1beta1](http://bootstrap.cluster.x-k8s.io/v1beta1), Kind=KubeadmConfigTemplate" eksa-system/mar-lab-eks-1-md-0-template-1741806828401: Internal error occurred: error resolving resource, action failed after 10 attempts: error creating "[controlplane.cluster.x-k8s.io/v1beta1](http://controlplane.cluster.x-k8s.io/v1beta1), Kind=KubeadmControlPlane" eksa-system/mar-lab-eks-1: Internal error occurred: error resolving resource]
Any help would really be appreciated.
By any chance, at least for the local cluster, are you using the generated template with no Cilium config?
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: my-cluster-name
spec:
clusterNetwork:
cniConfig:
cilium: {}
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
I learned I needed to set the routing mode to direct. After this, the local cluster was able to provision successfully, and no longer stuck in the Control plane nodes not ready yet phase.
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: my-cluster-name
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
cniConfig:
cilium:
routingMode: "direct"
ipv4NativeRoutingCIDR: 192.168.0.0/16
Hello @markoradisa , I would advise you to check if the inotify limits are OK: I recently solved a similar problem by following the following approach:
The machine “worker node group configuration” was failing:
kubectl get machines -n eksa-system -o wide
# mgmt-md-0-8jspn-zgjx7 is stuck in the Provisioning state indefinitely
docker logs mgmt-md-0-8jspn-zgjx7
...
Welcome to Amazon Linux 2023.6.20241212!
Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
...
sysctl fs.inotify.max_user_watches
fs.inotify.max_user_watches = 148727
sysctl fs.inotify.max_user_instances
fs.inotify.max_user_instances = 128
I solved this by increasing fs.inotify.max_user_instances from 128 to 512, as per the documentation: Pod errors due to too many open files :
sudo sysctl fs.inotify.max_user_instances=512
# To be sure, it is also better to do the following
sudo sysctl fs.inotify.max_user_watches=524288
The commands above do not survive system reboots, see the documentation on how to persistently increase them