eks-anywhere
eks-anywhere copied to clipboard
1st CP node always fails `MachineHealthCheck` when creating a Bottlerocket workload cluster
What happened: When creating a 3 control plane Bottlerocket EKS Anywhere cluster from an existing management, the first control plane node always fails because of MachineHealthCheck failure and gets deleted and re-provisioned. Once the node gets re-provisioned, the cluster is in a proper functioning state but it is still strange that the first control plane node always fails. Another strange thing is that this happens right about when EKS-A reports the the cluster is successfully created. We tried this several times and saw the same behavior over and over.
How to reproduce it (as minimally and precisely as possible): Create a Bottlerocket management cluster. Using this management cluster, attempt to create a 3 control plane Bottlerocket workload cluster and wait till the CLI finishes. Once it's done, check the machine status on the management cluster. It should show one node as deleting. I also had AWS IAM Authenticator deployed on both the management and workload cluster for this test. Not sure whether this matters or not but wanted to mention it.
Environment:
- EKS Anywhere Release: v0.11.1
- EKS Distro Release: v1.23