eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

[Bug] Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter

Open nathan-bowman opened this issue 1 year ago • 7 comments

What were you trying to accomplish?

eksctl upgrade nodegroup t2-medium-v1-28

What happened?

Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter

How to reproduce it?

Simply run eksctl upgrade nodegroup ...

Logs

2024-01-02 16:04:27 [ℹ]  updating nodegroup stack to a newer format before upgrading nodegroup version
2024-01-02 16:04:27 [ℹ]  updating nodegroup stack
2024-01-02 16:04:28 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211467" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:04:59 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211467" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:04:59 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:05:30 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:05:30 [ℹ]  upgrading nodegroup version
2024-01-02 16:05:30 [ℹ]  updating nodegroup stack
2024-01-02 16:05:30 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211530" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:01 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211530" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:02 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:32 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:07:15 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:09:00 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:09:45 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:11:31 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:12:34 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:14:26 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:16:23 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:16:54 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:17:24 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:17:57 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:18:44 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:20:27 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:22:02 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:23:38 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:24:20 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:25:40 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:26:25 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:27:19 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:28:52 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:30:44 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:31:55 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:32:41 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:33:43 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:35:35 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:37:02 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:37:46 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:38:47 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:39:54 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:40:39 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:41:20 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:42:41 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:43:12 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:44:45 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:46:27 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:48:24 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:49:28 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:50:29 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:50:32 [ℹ]  waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter

Anything else we need to know?

An important note to mention is that this problem is intermittent. Sometimes this happens, most times the nodegroup updates fine.

If I check CloudFormation, it will say UPDATE_COMPLETE and even eksctl reports that the nodegroup is updated and active...

# eksctl get nodegroup --cluster backend-staging --name t2-medium-v1-28
CLUSTER                 NODEGROUP       STATUS  CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID        ASG NAME                                                       TYPE
backend-staging    t2-medium-v1-28 ACTIVE  2023-11-06T20:03:26Z    3               5               3                       t2.medium       AL2_x86_64      eks-t2-medium-v1-28-a4c5d31d-ca98-ae07-c328-035fff4b462c       managed

Meanwhile I'm left with: waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"

OS:

# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"

eksctl installed with:

ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
mv /tmp/eksctl /usr/local/bin

Versions

# eksctl info
eksctl version: 0.165.0
kubectl version: v1.28.4
OS: linux

nathan-bowman avatar Jan 02 '24 17:01 nathan-bowman

Hello nathan-bowman :wave: Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

github-actions[bot] avatar Jan 02 '24 17:01 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 02 '24 01:02 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Not stale

nathan-bowman avatar Feb 02 '24 16:02 nathan-bowman

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Mar 04 '24 01:03 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Not stale

nathan-bowman avatar Mar 05 '24 14:03 nathan-bowman

The problem appears to surface when AWS credentials expire while the upgrade is taking place.

yuxiang-zhang avatar Mar 20 '24 23:03 yuxiang-zhang

Yes I tried to reproduce this and this log line I've added seems to confirm my theory:

2024-03-21 00:50:18 [ℹ]  waiting for CloudFormation stack "eksctl-x-nodegroup-ng-0"
2024-03-21 00:50:18 [!]  err: operation error CloudFormation: DescribeStacks, https response error StatusCode: 403, RequestID: -, api error ExpiredToken: The security token included in the request is expired

The SDK configures 403 errors as retryable. Similar issues were reported to the SDK team, e.g. https://github.com/aws/aws-sdk-go/issues/2389 and https://github.com/aws/aws-sdk-go/issues/4983#issuecomment-1724570871

Edit: there was a "fix" for STS https://github.com/hashicorp/aws-sdk-go-base/pull/362, but here we are using the default retryer stackDeleteCompleteStateRetryable from CloudFormation instead of the standard retryer https://github.com/eksctl-io/eksctl/blob/76902cddd97a4e2d838158e6352addd95f7385b1/pkg/cfn/manager/waiters.go#L137-L141

I'm inclined to just catch the ExpiredToken error and abort the waiter.

yuxiang-zhang avatar Mar 21 '24 00:03 yuxiang-zhang