eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

[Help] Unable to upgrade a managed nodegroup

Open ybykov-a9s opened this issue 1 year ago • 2 comments

Hello!

I can’t upgrade a managed nodegroup using eksctl

Following document was used for the procedure: https://docs.aws.amazon.com/eks/latest/userguide/update-managed-node-group.html#mng-update

Steps to reproduce:

Create a cluster using following manifest

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: yby-test
  region: eu-central-1
  version: "1.28"
managedNodeGroups:
  - name: mng-medium
    instanceType: t3a.medium
    desiredCapacity: 2
    minSize: 1
    maxSize: 2
    volumeSize: 10
    iam:
      withAddonPolicies:
        ebs: true

It gets created successfully

Next I upgrade a control plane's kubernetes version using following command:

eksctl upgrade cluster --name yby-test --region eu-central-1 --approve

Everything works fine:

2024-09-27 15:05:31 [ℹ]  will upgrade cluster "yby-test" control plane from current version "1.28" to "1.29"
2024-09-27 15:14:54 [✔]  cluster "yby-test" control plane has been upgraded to version "1.29"
2024-09-27 15:14:54 [ℹ]  you will need to follow the upgrade procedure for all of nodegroups and add-ons
2024-09-27 15:14:55 [ℹ]  re-building cluster stack "eksctl-yby-test-cluster"
2024-09-27 15:14:55 [✔]  all resources in cluster stack "eksctl-yby-test-cluster" are up-to-date
2024-09-27 15:14:55 [ℹ]  checking security group configuration for all nodegroups
2024-09-27 15:14:55 [ℹ]  all nodegroups have up-to-date cloudformation templates

And then I try to upgrade nodegroup to the target version using:

eksctl upgrade nodegroup --cluster yby-test --region eu-central-1 --name mng-medium --kubernetes-version=1.29

Here is the log:

2024-09-27 15:17:45 [ℹ]  will upgrade nodes to release version: 1.29.8-20240917
2024-09-27 15:17:45 [ℹ]  upgrading nodegroup version
2024-09-27 15:17:45 [ℹ]  updating nodegroup stack
2024-09-27 15:17:46 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1727443065" for stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:18:16 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1727443065" for stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:18:16 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:18:46 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:19:28 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:20:28 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:21:36 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:23:25 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:24:58 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:25:34 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:26:44 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:28:41 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:30:12 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:31:48 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
2024-09-27 15:33:39 [ℹ]  waiting for CloudFormation stack "eksctl-yby-test-nodegroup-mng-medium"
Error: error updating nodegroup stack: waiter state transitioned to Failure

If I check Cloudformation console I see a following event:

ManagedNodeGroup
Resource handler returned message: "Requested release version 1.29.8-20240917 is not valid for kubernetes version 1.28. (Service: Eks, Status Code: 400, Request ID: 00c5f96d-c686-42a6-98e8-06abde8621d6)" (RequestToken: 38c80d12-e9e8-12b7-ab49-6c7cf2a65b6c, HandlerErrorCode: InvalidRequest)

If I try to upgrade a node pool using AWS web console everything works fine, but without any changes in Cloudformation logs. Therefore I suppose it doesn't use Cloudformation.

eksctl version
0.190.0-dev+3fccc8ed8.2024-09-04T12:58:57Z

What help do you need?

Please point me if I misunderstood the documentation or if it's a bug. Maybe there are other actions which nave to be done.

Tell me if I should provide more information or tests.

Thanks in advance.

-- Eugene Bykov

ybykov-a9s avatar Sep 27 '24 14:09 ybykov-a9s

Hello ybykov-a9s :wave: Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

github-actions[bot] avatar Sep 27 '24 14:09 github-actions[bot]

I am also facing this issue. And I checked the CloudFormation, it mentioned the Resource handler returned message: "Volume of size 10GB is smaller than snapshot 'snap-0145xxxxxx10a66e4', expect size>= 20GB

But I can do that (less than 20 GB) in my another account. They are both in the same region. The only difference I can think of is k8s cluster version. I can create node with 10 GB in 1.29, can't in 1.30

The eksctl version I am using is 0.184

TreeKat71 avatar Oct 04 '24 16:10 TreeKat71

Any update on this ? Im facing same issue, Resource handler returned message: "Requested release version 1.31.0-20241024 is not valid for kubernetes version 1.30. (Service: Eks, Status Code: 400, Request ID: 15e2fb73-4134-4763-94d4-6b1ffc6d04b3)" (RequestToken: 1565436d-5bbc-7be1-7081-7a0631cf5842, HandlerErrorCode: InvalidRequest)

After I upgraded successfully control plane to 1.31, i cannot upgrade managed node group to 1.31.

roman5595 avatar Oct 30 '24 21:10 roman5595

Requested release version 1.31.0-20241024 is not valid for kubernetes version 1.30

Are you sure your control plane is already updated?


And I found the reason why I can not upgrade my managed nodegroup.

I am also facing this issue. And I checked the CloudFormation, it mentioned the Resource handler returned message: "Volume of size 10GB is smaller than snapshot 'snap-0145xxxxxx10a66e4', expect size>= 20GB

But I can do that (less than 20 GB) in my another account. They are both in the same region. The only difference I can think of is k8s cluster version. I can create node with 10 GB in 1.29, can't in 1.30

The eksctl version I am using is 0.184

I am using two different AMI and OS. So AmazonLinux2 is able to reduce the disk size to 8 but AmazonLinux2023 can not. That is what I currently know... I am not sure if it is documented.

TreeKat71 avatar Nov 28 '24 07:11 TreeKat71

I also have the exact same problem when trying to upgrade the managed node group.

Resource handler returned message: "Requested release version 1.31.3-20250103 is not valid for kubernetes version 1.30. (Service: Eks, Status Code: 400, Request ID: 4aa1ba6a-840a-40ec-9934-0809b7c92538)" (RequestToken: a328d53f-60e2-b3a1-ba8f-5e6daa9d59ef, HandlerErrorCode: InvalidRequest

The control plane was already upgraded from version 1.30 to 1.31 successfully.

$ eksctl upgrade cluster --approve --name eks-analytics
2025-01-08 15:25:55 [ℹ]  will upgrade cluster "eks-analytics" control plane from current version "1.30" to "1.31"
2025-01-08 15:35:50 [✔]  cluster "eks-analytics" control plane has been upgraded to version "1.31"
2025-01-08 15:35:50 [ℹ]  you will need to follow the upgrade procedure for all of nodegroups and add-ons
2025-01-08 15:35:51 [ℹ]  re-building cluster stack "eksctl-eks-analytics-cluster"
2025-01-08 15:35:51 [✔]  all resources in cluster stack "eksctl-eks-analytics-cluster" are up-to-date
2025-01-08 15:35:52 [ℹ]  checking security group configuration for all nodegroups
2025-01-08 15:35:52 [ℹ]  all nodegroups have up-to-date cloudformation templates

It shows the new version in the AWS console as well as via the command:

$ eksctl get cluster --name eks-analytics --output json | jq -r '.[].Version'
1.31

I also upgraded all the podidentityassociations successfully.

2025-01-08 15:35:55 [ℹ]  
2 parallel tasks: { update pod identity association kube-system/aws-load-balancer-controller, update pod identity association cert-manager/cert-manager 
}
2025-01-08 15:35:56 [ℹ]  updating IAM resources stack "eksctl-eks-analytics-podidentityrole-cert-manager-cert-manager" for pod identity association "cert-manager/cert-manager"
2025-01-08 15:35:56 [ℹ]  updating IAM resources stack "eksctl-eks-analytics-podidentityrole-kube-system-aws-load-balancer-controller" for pod identity association "kube-system/aws-load-balancer-controller"
2025-01-08 15:35:56 [ℹ]  waiting for CloudFormation changeset "eksctl-kube-system-aws-load-balancer-controller-update-1736321756" for stack "eksctl-eks-analytics-podidentityrole-kube-system-aws-load-balancer-controller"
2025-01-08 15:35:56 [ℹ]  nothing to update
2025-01-08 15:35:56 [ℹ]  IAM resources for kube-system/aws-load-balancer-controller (pod identity association ID: kube-system/aws-load-balancer-controller) are already up-to-date
2025-01-08 15:35:56 [ℹ]  waiting for CloudFormation changeset "eksctl-cert-manager-cert-manager-update-1736321756" for stack "eksctl-eks-analytics-podidentityrole-cert-manager-cert-manager"
2025-01-08 15:35:56 [ℹ]  nothing to update
2025-01-08 15:35:56 [ℹ]  IAM resources for cert-manager/cert-manager (pod identity association ID: cert-manager/cert-manager) are already up-to-date
2025-01-08 15:35:56 [ℹ]  all tasks were completed successfully

And the addons

2025-01-08 15:35:59 [ℹ]  Kubernetes version "1.31" in use by cluster "eks-analytics"
2025-01-08 15:35:59 [ℹ]  updating addon
2025-01-08 15:38:02 [ℹ]  addon "aws-ebs-csi-driver" active
2025-01-08 15:38:02 [ℹ]  updating addon
2025-01-08 15:38:13 [ℹ]  addon "coredns" active
2025-01-08 15:38:13 [ℹ]  updating addon
2025-01-08 15:38:24 [ℹ]  addon "eks-pod-identity-agent" active
2025-01-08 15:38:24 [ℹ]  new version provided v1.31.3-eksbuild.2
2025-01-08 15:38:24 [ℹ]  updating addon
2025-01-08 15:39:07 [ℹ]  addon "kube-proxy" active
2025-01-08 15:39:08 [ℹ]  updating addon
2025-01-08 15:39:18 [ℹ]  addon "vpc-cni" active

At first I just tried to do the following to upgrade the node group and it finished without error but left the node group at version v1.30

$ eksctl upgrade nodegroup --cluster eks-analytics --name eks-analytics-ng-1 --wait
2025-01-08 15:40:00 [ℹ]  setting ForceUpdateEnabled value to false
2025-01-08 15:40:00 [ℹ]  updating nodegroup stack
2025-01-08 15:40:01 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1736322000" for stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:40:31 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1736322000" for stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:40:31 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:41:01 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:41:02 [ℹ]  nodegroup "eks-analytics-ng-1" is already up-to-date
2025-01-08 15:41:02 [ℹ]  will upgrade nodes to Kubernetes version: 1.30
2025-01-08 15:41:02 [ℹ]  upgrading nodegroup version
2025-01-08 15:41:02 [ℹ]  updating nodegroup stack
2025-01-08 15:41:02 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1736322062" for stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:41:32 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1736322062" for stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:41:33 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:42:03 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:42:03 [ℹ]  nodegroup successfully upgraded

But noticed it left things at version 1.30. So then I tried the following which resulted in the error above within the CloudFormation

$ eksctl upgrade nodegroup --cluster eks-analytics --kubernetes-version 1.31 --name eks-analytics-ng-1 --wait
2025-01-08 15:57:09 [ℹ]  will upgrade nodes to release version: 1.31.3-20250103
2025-01-08 15:57:09 [ℹ]  upgrading nodegroup version
2025-01-08 15:57:09 [ℹ]  updating nodegroup stack
2025-01-08 15:57:09 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1736323029" for stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:57:39 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1736323029" for stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:57:40 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:58:10 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:59:03 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 15:59:52 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 16:01:02 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 16:01:52 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
2025-01-08 16:03:18 [ℹ]  waiting for CloudFormation stack "eksctl-eks-analytics-nodegroup-eks-analytics-ng-1"
Error: error updating nodegroup stack: waiter state transitioned to Failure

jim-barber-he avatar Jan 08 '25 08:01 jim-barber-he

Got the same issue with upgrading node groups from 1.29 to 1.30 Control plane was updated successfully

$ eksctl version
0.201.0

dmzeus avatar Jan 31 '25 12:01 dmzeus

We had the same problem on some clusters and we noticed that the cloudformation generated for upgrading NodeGroup put wrong "Version" number in "AWS::EKS::Nodegroup". Here we've upgraded from 1.30 to 1.31, but version is still 1.30 in CFN. We had the message "Requested release version 1.31.4-20250123 is not valid for kubernetes version 1.30. " .

 "ManagedNodeGroup": {
      "Type": "AWS::EKS::Nodegroup",
      "Properties": {
        "AmiType": "AL2_x86_64",
        "ClusterName": "testcluster",
        "ForceUpdateEnabled": true,
        "InstanceTypes": [
          "c5a.large",
          "c5.large",
          "c6i.large"
        ],
        "Labels": {
          "alpha.eksctl.io/cluster-name": "testcluster",
          "alpha.eksctl.io/nodegroup-name": "ng-eks-1",
        },
        "LaunchTemplate": {
          "Id": {
            "Ref": "LaunchTemplate"
          }
        },
        "NodeRole": {
          "Fn::GetAtt": [
            "NodeInstanceRole",
            "Arn"
          ]
        },
        "NodegroupName": "ng-eks-1",
        "ReleaseVersion": "1.31.4-20250123",
        "ScalingConfig": {
          "DesiredSize": 2,
          "MaxSize": 4,
          "MinSize": 2
        },
        "Subnets": [
          "subnet-0800de77f4fd29000",
          "subnet-0500ceeab196d00c"
        ],
        "Tags": {
          "alpha.eksctl.io/nodegroup-name": "ng-eks-1",
          "alpha.eksctl.io/nodegroup-type": "managed",
          "k8s.io/cluster-autoscaler/testcluster": "owned",
          "k8s.io/cluster-autoscaler/enabled": "true",
        },
        "Taints": [
          {
            "Effect": "NO_EXECUTE",
            "Key": "node.cilium.io/agent-not-ready",
            "Value": "true"
          }
        ],
        "UpdateConfig": {
          "MaxUnavailable": 2
        },
        "Version": "1.30"
      }
    }, 

As a workaround, we manually change the cloudformation to replace version 1.30 with 1.31 and update the CFN stack to make it work. I haven't yet managed to find out where this version was recovered. I hope this will help unblock those who are in this situation.

bartleboeuf avatar Feb 03 '25 14:02 bartleboeuf

It works properly if you create and upgrade your cluster from config file:

eksctl create cluster -f <cluster_config>.yaml
eksctl upgrade cluster -f <cluster_config>.yaml

And doesn't work If you create cluster from config file but upgrade using --name arg :)

dmzeus avatar Feb 03 '25 14:02 dmzeus

+1 on running into this problem now.

eksctl version
0.204.0

eksctl upgrade cluster -f eks.yaml --approve

2025-02-14 12:37:25 [!]  NOTE: cluster VPC (subnets, routing & NAT Gateway) configuration changes are not yet implemented
2025-02-14 12:37:26 [ℹ]  will upgrade cluster "eks" control plane from current version "1.31" to "1.32"
2025-02-14 12:45:22 [✔]  cluster "eks" control plane has been upgraded to version "1.32"
2025-02-14 12:45:22 [ℹ]  you will need to follow the upgrade procedure for all of nodegroups and add-ons
2025-02-14 12:45:22 [ℹ]  re-building cluster stack "eksctl-eks-cluster"
2025-02-14 12:45:22 [✔]  all resources in cluster stack "eksctl-eks-cluster" are up-to-date
2025-02-14 12:45:22 [ℹ]  checking security group configuration for all nodegroups
2025-02-14 12:45:22 [ℹ]  all nodegroups have up-to-date cloudformation templates

eksctl upgrade nodegroup --name=ng-1-workers --cluster=eks--kubernetes-version=1.32

Requested release version 1.32.0-20250203 is not valid for kubernetes version 1.31. 

I can also confirm that manually changing the version in the cloudformation template got me past this issue.

twarkie avatar Feb 14 '25 12:02 twarkie

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Mar 17 '25 02:03 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Mar 23 '25 02:03 github-actions[bot]

Had same issue upgrading a handful of environments that were multiple versions behind the latest version in EKS. I found that updating the "Version" field in the cloudformation as described above allowed me to upgrade all of my node groups to the next version to align with the control plane (1.30 -> 1.31). However, eksctl still failed on the next pass to upgrade to the next release.

I had one environment that upgraded across all versions without issue and compared the cloudformation files for that environment to the files from the failing environments. What I found was that the "good" environment had no "Version" field at all in the cloudformation file. So, for my last set of environment upgrades, I first upgraded the ManagedNodeGroup resource by updating the ReleaseVersion to a version that aligned with the control plane, and removed the Version field entirely. After those updates applied successfully, I was able to use eksctl to perform subsequent upgrades.

TLDR; Remove the "Version" field from the ManagedNodeGroup resource in the cloudformation files for the first round of upgrades. eksctl will likely work on later upgrades.

twehner avatar Apr 18 '25 02:04 twehner