eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

EKS v0.19.5 Creating cluster in Docker fails at some point

Open abregar opened this issue 9 months ago • 3 comments

Considering this a problem classification. Tried to initialize dev cluster on Macos (Sonoma 14.4.1), Docker desktop v4.30.0 as documentation suggests with a higher verbosity level:

eksctl anywhere create cluster -f $CLUSTER_NAME.yaml -v 9

Using latest release as in:

Initializing long running container     {"name": "eksa_1715243480381280000", "image": "public.ecr.aws/eks-anywhere/cli-tools:v0.19.5-eks-a-65"}

Initialization goes well, containers for control-plane, lb, etcd, .. are successfully created. But creation process then stops at this point:

24-05-09T10:52:50.466+0200    V1      cleaning up temporary namespace  for diagnostic collectors      {"namespace": "eksa-diagnostics"}
2024-05-09T10:52:50.466+0200    V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2024-05-09T10:52:50.466+0200    V6      Executing command       {"cmd": "/usr/local/bin/docker exec -i eksa_1715244428714146000 kubectl delete namespace eksa-diagnostics --kubeconfig mgmt/mgmt-eks-a-cluster.kubeconfig"}
2024-05-09T10:52:55.641+0200    V5      Retry execution successful      {"retries": 1, "duration": "5.175007875s"}
2024-05-09T10:52:55.642+0200    V4      Task finished   {"task_name": "collect-cluster-diagnostics", "duration": "17.227805209s"}
2024-05-09T10:52:55.642+0200    V4      ----------------------------------
2024-05-09T10:52:55.642+0200    V4      Saving checkpoint       {"file": "mgmt-checkpoint.yaml"}
2024-05-09T10:52:55.643+0200    V4      Tasks completed {"duration": "5m38.393764542s"}
2024-05-09T10:52:55.643+0200    V3      Cleaning up long running container      {"name": "eksa_1715244428714146000"}
2024-05-09T10:52:55.643+0200    V6      Executing command       {"cmd": "/usr/local/bin/docker rm -f -v eksa_1715244428714146000"}
Error: creating namespace eksa-system: The connection to the server localhost:8080 was refused - did you specify the right host or port?

To me, it looks like that temporary container is rm too early and script does not handle the missing kubeconfig then.

So, questions - is this considered a bug, is it possible to workaround quickly somehow and is there a possibility to continue the cluster creation procedure from the failing point?

abregar avatar May 09 '24 09:05 abregar