eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

Unable to create cluster in Docker or Nutanix

Open markoradisa opened this issue 10 months ago • 3 comments

What happened:

Today I unable to create EKS cluster locally or on Nutanix. Inspecting the logs I get the following error. Both has been happening on Nutanix and Docker (probably as it first gets build in Docker).

On each occasion it seems to have an issue with creating control plane or accessing control plane.

Please kind in mind I have previously created clusters with no issues. I have today tried several version of eks anywhere from latest to version v0.19.0. Also tried several docker versions.

Any advice would be great.

Also tried my home machine as well as work to rule out any networking issues.

{"T":1741646300318754241,"M":"Executing command","cmd":"/usr/bin/docker exec -i eksa_1741646029888252550 kubectl get --ignore-not-found -o json --kubeconfig docker/generated/docker.kind.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default docker"}
{"T":1741646300466424594,"M":"Cluster generation and observedGeneration","Generation":1,"ObservedGeneration":1}
{"T":1741646300466484939,"M":"Error happened during retry","error":"cluster condition ControlPlaneReady is False: Control plane nodes not ready yet, 1 expected (0 ready)","retries":50}
{"T":1741646300466502600,"M":"Sleeping before next retry","time":"1s"}
{"T":1741646301466735454,"M":"Executing command","cmd":"/usr/bin/docker exec -i eksa_1741646029888252550 kubectl get --ignore-not-found -o json --kubeconfig docker/generated/docker.kind.kubeconfig Cluster.v1alph

I have previously created clusters with no issues. While having issues creating cluster on Nutanix, I tried using docker as provisioner.. Both experiencing issues.

Docker version

Client:
 Version:           26.1.3
 API version:       1.45
 Go version:        go1.22.2
 Git commit:        26.1.3-0ubuntu1~24.04.1
 Built:             Mon Oct 14 14:29:26 2024
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          26.1.3
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.22.2
  Git commit:       26.1.3-0ubuntu1~24.04.1
  Built:            Mon Oct 14 14:29:26 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.24
  GitCommit:        
 runc:
  Version:          1.1.12-0ubuntu3.1
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        

markoradisa avatar Mar 10 '25 22:03 markoradisa

I went back to eks anywhere v0.18.4 which got a bit further. Creates VM's in Nutanix and seems to create Kubernetes cluster but it doesn't seem to copy any of the EKS Anywhere from eksa-system.

Basically in checkpoint it says all is null...

2025-03-12T19:20:37.635Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1741806693342302661 kubectl apply -f - --kubeconfig mar-lab-eks-1/mar-lab-eks-1-eks-a-cluster.kubeconfig"}

2025-03-12T19:20:38.121Z        V5      Retry execution successful      {"retries": 1, "duration": "486.076278ms"}

2025-03-12T19:20:38.121Z        V0      ⏳ Collecting support bundle from cluster, this can take a while        {"cluster": "mar-lab-eks-1", "bundle": "mar-lab-eks-1/generated/mar-lab-eks-1-2025-03-12T19:20:37Z-bundle.yaml", "since": "2025-03-12T16:20:37.454Z", "kubeconfig": "mar-lab-eks-1/mar-lab-eks-1-eks-a-cluster.kubeconfig"}

2025-03-12T19:20:38.121Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1741806693342302661 support-bundle mar-lab-eks-1/generated/mar-lab-eks-1-2025-03-12T19:20:37Z-bundle.yaml --kubeconfig mar-lab-eks-1/mar-lab-eks-1-eks-a-cluster.kubeconfig --interactive=false --since-time 2025-03-12T16:20:37.454335648Z"}

2025-03-12T19:23:34.551Z        V9      docker  {"stderr": "Error: failed to run collect and analyze process: failed to run collectors: failed to redact in cluster collector results: failed to decompress file: unexpected EOF\n"}

2025-03-12T19:23:34.551Z        V5      Error collecting and saving logs        {"error": "failed to Collect support bundle: executing support-bundle: Error: failed to run collect and analyze process: failed to run collectors: failed to redact in cluster collector results: failed to decompress file: unexpected EOF\n"}

2025-03-12T19:23:34.551Z        V4      Task finished   {"task_name": "collect-cluster-diagnostics", "duration": "3m12.060411663s"}

2025-03-12T19:23:34.551Z        V4      ----------------------------------

2025-03-12T19:23:34.551Z        V4      Saving checkpoint       {"file": "mar-lab-eks-1-checkpoint.yaml"}

2025-03-12T19:23:34.551Z        V4      Tasks completed {"duration": "11m50.184723194s"}

2025-03-12T19:23:34.551Z        V3      Cleaning up long running container      {"name": "eksa_1741806693342302661"}

2025-03-12T19:23:34.551Z        V6      Executing command       {"cmd": "/usr/bin/docker rm -f -v eksa_1741806693342302661"}

Error: moving CAPI management from source to target: failed moving management cluster: Performing move...

Discovering Cluster API objects

Moving Cluster API objects Clusters=1

Moving Cluster API objects ClusterClasses=0

Creating objects in the target cluster

Error: [action failed after 10 attempts: error creating "[bootstrap.cluster.x-k8s.io/v1beta1](http://bootstrap.cluster.x-k8s.io/v1beta1), Kind=KubeadmConfigTemplate" eksa-system/mar-lab-eks-1-md-0-template-1741806828401: Internal error occurred: error resolving resource, action failed after 10 attempts: error creating "[controlplane.cluster.x-k8s.io/v1beta1](http://controlplane.cluster.x-k8s.io/v1beta1), Kind=KubeadmControlPlane" eksa-system/mar-lab-eks-1: Internal error occurred: error resolving resource]

Any help would really be appreciated.

markoradisa avatar Mar 13 '25 08:03 markoradisa

By any chance, at least for the local cluster, are you using the generated template with no Cilium config?

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: my-cluster-name
spec:
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12

I learned I needed to set the routing mode to direct. After this, the local cluster was able to provision successfully, and no longer stuck in the Control plane nodes not ready yet phase.

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: my-cluster-name
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
    cniConfig:
      cilium:
        routingMode: "direct"
        ipv4NativeRoutingCIDR: 192.168.0.0/16

jollygoose avatar Jun 04 '25 22:06 jollygoose

Hello @markoradisa , I would advise you to check if the inotify limits are OK: I recently solved a similar problem by following the following approach:

The machine “worker node group configuration” was failing:

kubectl get machines -n eksa-system -o wide
# mgmt-md-0-8jspn-zgjx7 is stuck in the Provisioning state indefinitely

docker logs mgmt-md-0-8jspn-zgjx7

...
Welcome to Amazon Linux 2023.6.20241212!

Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

...

sysctl fs.inotify.max_user_watches
fs.inotify.max_user_watches = 148727

sysctl fs.inotify.max_user_instances 
fs.inotify.max_user_instances = 128

I solved this by increasing fs.inotify.max_user_instances from 128 to 512, as per the documentation: Pod errors due to too many open files :

sudo sysctl fs.inotify.max_user_instances=512

# To be sure, it is also better to do the following
sudo sysctl fs.inotify.max_user_watches=524288

The commands above do not survive system reboots, see the documentation on how to persistently increase them

AndreaTosti avatar Jul 16 '25 12:07 AndreaTosti