cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
MachinePool and MachineDeployment added to EKS cluster can some times get stuck in scaling up state
/kind bug
What steps did you take and what happened:
When creating EKS clusters through CAPI with AWSManagedControlPlane as the control plane and MachinePool/MachineDeployment as the node group we observed that machines will some times get stuck in "provisioned" state. This causes the corresponding machine pool and machine deployment to remain in "ScalingUp" state. Upon debugging we observed the following error in the cloudinit log on the VMs created for machine deployment/pool -
Cloud-init v. 19.3-45.amzn2 running 'modules:final' at Thu, 08 Sep 2022 18:30:04 +0000. Up 14.63 seconds.
Waiter ClusterActive failed: Max attempts exceeded
Exited with error on line 369
Sep 08 18:49:40 cloud-init[3349]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-002 [255]
Sep 08 18:49:40 cloud-init[3349]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Sep 08 18:49:40 cloud-init[3349]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Cloud-init v. 19.3-45.amzn2 finished at Thu, 08 Sep 2022 18:49:40 +0000. Datasource DataSourceEc2. Up 1190.92 seconds
The line that fails from /etc/eks/bootstrap.sh -
aws eks wait cluster-active \
--region=${AWS_DEFAULT_REGION} \
--name=${CLUSTER_NAME}
Increasing the number of retry attempts through EKSConfig/EKSConfigTemplate does not help either. This variable is used in above script to implement a retry logic however looking at the cloudinit logs the retry does not occur. This might be due to command failing with non-zero exit code.
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: EKSConfig
metadata:
name: test-pma-eks-2-mp-1
namespace: default
spec:
apiRetryAttempts: 15
Once the machine pool/deployments get stuck in this manner in "ScalingUp" state the only way to resolve the issue is to scale it down to 0 and then scale it up to desired count again after the cluster / control plane is ready.
I have attached the YAML used to provisioning the cluster. test-eks-from-ui.yaml.txt
What did you expect to happen:
Machine pools/deployments created along with EKS cluster go into running state automatically and are added to the EKS cluster.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
I suspect this happens because the control plane is not ready when the nodes get created which causes the bootstrap script to fail.
Another thing is that the retry logic in the bootstrap.sh should account for aws eks wait ... command to fail and retry it again. Currently there is set -o errexit at the start of the script which will cause the script to fail if any command fails and it won't be retried.
Environment:
- Cluster-api-provider-aws version: v1.5.0 and v1.4.1
- Kubernetes version: (use
kubectl version):
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.12", GitCommit:"b058e1760c79f46a834ba59bd7a3486ecf28237d", GitTreeState:"clean", BuildDate:"2022-07-13T14:53:39Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
- OS (e.g. from
/etc/os-release):
@pacharya-pf9: This issue is currently awaiting triage.
If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@Ankitasw I looked through the changes in https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3743 but can you elaborate a bit more on how that fixes this issue?
I have not followed this issue. cc @richcase if you know how this is related to updation of wrong providerIDList
@Ankitasw: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
From triage:
- We need to get a repo to see whats going on
- We need to check retries with the machine reconcilers
/triage needs-information
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/triage accepted /priority backlog
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
This issue is currently awaiting triage.
If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.