eksctl
eksctl copied to clipboard
[Bug] Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter
What were you trying to accomplish?
eksctl upgrade nodegroup t2-medium-v1-28
What happened?
Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter
How to reproduce it?
Simply run eksctl upgrade nodegroup ...
Logs
2024-01-02 16:04:27 [ℹ] updating nodegroup stack to a newer format before upgrading nodegroup version
2024-01-02 16:04:27 [ℹ] updating nodegroup stack
2024-01-02 16:04:28 [ℹ] waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211467" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:04:59 [ℹ] waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211467" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:04:59 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:05:30 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:05:30 [ℹ] upgrading nodegroup version
2024-01-02 16:05:30 [ℹ] updating nodegroup stack
2024-01-02 16:05:30 [ℹ] waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211530" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:01 [ℹ] waiting for CloudFormation changeset "eksctl-update-nodegroup-1704211530" for stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:02 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:06:32 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:07:15 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:09:00 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:09:45 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:11:31 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:12:34 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:14:26 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:16:23 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:16:54 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:17:24 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:17:57 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:18:44 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:20:27 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:22:02 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:23:38 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:24:20 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:25:40 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:26:25 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:27:19 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:28:52 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:30:44 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:31:55 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:32:41 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:33:43 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:35:35 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:37:02 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:37:46 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:38:47 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:39:54 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:40:39 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:41:20 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:42:41 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:43:12 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:44:45 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:46:27 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:48:24 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:49:28 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:50:29 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
2024-01-02 16:50:32 [ℹ] waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
Error: error updating nodegroup stack: exceeded max wait time for StackUpdateComplete waiter
Anything else we need to know?
An important note to mention is that this problem is intermittent. Sometimes this happens, most times the nodegroup updates fine.
If I check CloudFormation, it will say UPDATE_COMPLETE
and even eksctl reports that the nodegroup is updated and active...
# eksctl get nodegroup --cluster backend-staging --name t2-medium-v1-28
CLUSTER NODEGROUP STATUS CREATED MIN SIZE MAX SIZE DESIRED CAPACITY INSTANCE TYPE IMAGE ID ASG NAME TYPE
backend-staging t2-medium-v1-28 ACTIVE 2023-11-06T20:03:26Z 3 5 3 t2.medium AL2_x86_64 eks-t2-medium-v1-28-a4c5d31d-ca98-ae07-c328-035fff4b462c managed
Meanwhile I'm left with:
waiting for CloudFormation stack "eksctl-backend-staging-nodegroup-t2-medium-v1-28"
OS:
# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
eksctl installed with:
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
mv /tmp/eksctl /usr/local/bin
Versions
# eksctl info
eksctl version: 0.165.0
kubectl version: v1.28.4
OS: linux
Hello nathan-bowman :wave: Thank you for opening an issue in eksctl
project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl
on our website
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Not stale
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Not stale
The problem appears to surface when AWS credentials expire while the upgrade is taking place.
Yes I tried to reproduce this and this log line I've added seems to confirm my theory:
2024-03-21 00:50:18 [ℹ] waiting for CloudFormation stack "eksctl-x-nodegroup-ng-0"
2024-03-21 00:50:18 [!] err: operation error CloudFormation: DescribeStacks, https response error StatusCode: 403, RequestID: -, api error ExpiredToken: The security token included in the request is expired
The SDK configures 403 errors as retryable. Similar issues were reported to the SDK team, e.g. https://github.com/aws/aws-sdk-go/issues/2389 and https://github.com/aws/aws-sdk-go/issues/4983#issuecomment-1724570871
Edit: there was a "fix" for STS https://github.com/hashicorp/aws-sdk-go-base/pull/362, but here we are using the default retryer stackDeleteCompleteStateRetryable
from CloudFormation instead of the standard retryer
https://github.com/eksctl-io/eksctl/blob/76902cddd97a4e2d838158e6352addd95f7385b1/pkg/cfn/manager/waiters.go#L137-L141
I'm inclined to just catch the ExpiredToken
error and abort the waiter.