cluster-api-provider-aws MachinePool and MachineDeployment added to EKS cluster can some times get stuck in scaling up state

/kind bug

What steps did you take and what happened:

When creating EKS clusters through CAPI with AWSManagedControlPlane as the control plane and MachinePool/MachineDeployment as the node group we observed that machines will some times get stuck in "provisioned" state. This causes the corresponding machine pool and machine deployment to remain in "ScalingUp" state. Upon debugging we observed the following error in the cloudinit log on the VMs created for machine deployment/pool -

Cloud-init v. 19.3-45.amzn2 running 'modules:final' at Thu, 08 Sep 2022 18:30:04 +0000. Up 14.63 seconds.

Waiter ClusterActive failed: Max attempts exceeded
Exited with error on line 369
Sep 08 18:49:40 cloud-init[3349]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-002 [255]
Sep 08 18:49:40 cloud-init[3349]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Sep 08 18:49:40 cloud-init[3349]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Cloud-init v. 19.3-45.amzn2 finished at Thu, 08 Sep 2022 18:49:40 +0000. Datasource DataSourceEc2.  Up 1190.92 seconds

The line that fails from /etc/eks/bootstrap.sh -

        aws eks wait cluster-active \
            --region=${AWS_DEFAULT_REGION} \
            --name=${CLUSTER_NAME}

Increasing the number of retry attempts through EKSConfig/EKSConfigTemplate does not help either. This variable is used in above script to implement a retry logic however looking at the cloudinit logs the retry does not occur. This might be due to command failing with non-zero exit code.

apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: EKSConfig
metadata:
  name: test-pma-eks-2-mp-1
  namespace: default
spec:
  apiRetryAttempts: 15

Once the machine pool/deployments get stuck in this manner in "ScalingUp" state the only way to resolve the issue is to scale it down to 0 and then scale it up to desired count again after the cluster / control plane is ready.

I have attached the YAML used to provisioning the cluster. test-eks-from-ui.yaml.txt

What did you expect to happen:

Machine pools/deployments created along with EKS cluster go into running state automatically and are added to the EKS cluster.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

I suspect this happens because the control plane is not ready when the nodes get created which causes the bootstrap script to fail.

Another thing is that the retry logic in the bootstrap.sh should account for aws eks wait ... command to fail and retry it again. Currently there is set -o errexit at the start of the script which will cause the script to fail if any command fails and it won't be retried.

Environment:

Cluster-api-provider-aws version: v1.5.0 and v1.4.1
Kubernetes version: (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.12", GitCommit:"b058e1760c79f46a834ba59bd7a3486ecf28237d", GitTreeState:"clean", BuildDate:"2022-07-13T14:53:39Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

OS (e.g. from /etc/os-release):

Sep 09 '22 21:09 pacharya-pf9

@pacharya-pf9: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 09 '22 21:09 k8s-ci-robot

@Ankitasw I looked through the changes in https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3743 but can you elaborate a bit more on how that fixes this issue?

Sep 28 '22 16:09 pacharya-pf9

I have not followed this issue. cc @richcase if you know how this is related to updation of wrong providerIDList

Sep 28 '22 17:09 Ankitasw

@Ankitasw: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 04 '22 08:10 k8s-ci-robot

From triage:

We need to get a repo to see whats going on
We need to check retries with the machine reconcilers

/triage needs-information

Oct 17 '22 16:10 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 15 '23 17:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 14 '23 17:02 k8s-triage-robot

/triage accepted /priority backlog

Mar 06 '23 17:03 dlipovetsky

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Mar 05 '24 17:03 k8s-triage-robot

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 05 '24 17:03 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 04 '24 18:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 04 '24 18:04 k8s-ci-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

MachinePool and MachineDeployment added to EKS cluster can some times get stuck in scaling up state

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard