terraform-aws-eks icon indicating copy to clipboard operation
terraform-aws-eks copied to clipboard

reconciliation of cluster_version and ami_release_version during node-group updates

Open AndreiBanaruTakeda opened this issue 1 year ago β€’ 11 comments

Description

This issue is mainly related to the submodule eks-managed-node-group.

We use ami_type = "BOTTLEROCKET_x86_64" coupled with cluster_version and ami_release_version variables.

The ami_release_version is configured for us in a TFE Variable Set, applied to our TFE workspaces. This way we can control the version at mass. cluster_version is a data call to the EKS cluster so we retrieve its actual running version.

Let's consider the initial values:

ami_release_version = 1.20.5-a3e8bda1
cluster_version = 1.28

If the control plane is upgraded to 1.29 and I run a new plan and apply for the node-group configuration, the node-groups will be updated to cluster_version = 1.29 but the ami_release_version will be 1.21.1-82691b51 (which is latest, as of today).

I have to run a new plan and apply to bring the nodes back to the target ami_release_version:

ami_release_version = 1.20.5-a3e8bda1
cluster_version = 1.29
  • [x] βœ‹ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Before you submit an issue, please perform the following first:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

  • Module version [Required]: 20.24.0
  • Terraform version: 1.7.5
  • Provider version(s): 5.65.0

Reproduction Code [Required]

provider "aws" {
  region  = "us-east-1"
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.24.0"

  cluster_name    = "my-cluster"
  cluster_version = var.cluster_version


  cluster_endpoint_private_access              = true
  cluster_endpoint_public_access               = false
  create_cloudwatch_log_group                  = false
  create_cluster_security_group                = true
  create_iam_role                              = true
  create_node_security_group                   = true
  enable_irsa                                  = true
  node_security_group_enable_recommended_rules = true

  eks_managed_node_group_defaults = {
    vpc_security_group_ids = []
  }

  subnet_ids = var.subnet_ids
  vpc_id     = var.vpc_id
}

module "eks_managed_node_groups" {
  source  = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
  version = "20.24.0"

  cluster_name    = module.eks.cluster_name
  name            = join("", [module.eks.cluster_name, "-S-NG-001"])
  use_name_prefix = false

  vpc_security_group_ids = [module.eks.node_security_group_id]

  create_iam_role            = true
  iam_role_attach_cni_policy = true

  subnet_ids = var.subnet_ids

  min_size     = 2
  max_size     = 2
  desired_size = 2

  create_launch_template          = true
  launch_template_name            = join("", [module.eks.cluster_name, "-S-NG-001"])
  launch_template_use_name_prefix = false

  ami_type             = "BOTTLEROCKET_x86_64"
  ami_release_version  = data.aws_ssm_parameter.image_version[0].value
  cluster_version      = var.cluster_version
  cluster_auth_base64  = module.eks.cluster_certificate_authority_data
  cluster_endpoint     = module.eks.cluster_endpoint
  cluster_service_cidr = module.eks.cluster_service_cidr

  capacity_type  = "SPOT"
  instance_types = ["m5.xlarge"]
}

data "aws_ssm_parameter" "image_version" {
  count = var.ami_release_version != null ? 1 : 0
  name  = "/aws/service/bottlerocket/aws-k8s-${module.eks.cluster_version}/x86_64/${var.ami_release_version}/image_version"
}

variable "ami_release_version" {
  type    = string
  default = "1.20.5"
}

variable "subnet_ids" {
  type    = list(string)
}

variable "vpc_id" {
  type    = string
}

variable "cluster_version" {
  type    = string
  default = "1.28"
}

Steps to reproduce the behavior:

  1. use the above HCL to build the resources; set vpc_id and subnet_ids according your environment
  2. after resources are built, update cluster_version variable to 1.29 and apply
  3. control-plane will be upgraded from 1.28 to 1.29
  4. node-group will be updated to use a 1.29 AMI but with a release_version of 1.21.1-82691b51 instead of 1.20.5-a3e8bda1

Expected behavior

When both cluster_version and ami_release_version variables change, they should be reconciliated in one plan and apply.

Actual behavior

Two plans and apply are required to bring the nodes to a specific cluster_version and ami_release_version.

First plan will bring the cluster_version to the target version and the ami_release_version to the latest available version.

The second plan will downgrade the ami_release_version to the desired value.

Terminal Output Screenshot(s)

Update history tab: image

Additional context

AndreiBanaruTakeda avatar Sep 03 '24 13:09 AndreiBanaruTakeda

unfortunately, without a reproduction we will only be able to speculate

bryantbiggs avatar Sep 03 '24 14:09 bryantbiggs

I've updated the issue to include the IaC for reproduction

AndreiBanaruTakeda avatar Sep 03 '24 16:09 AndreiBanaruTakeda

Running:

aws eks update-nodegroup-version --cluster-name my-cluster --nodegroup-name my-cluster-S-NG-001 --kubernetes-version "1.30" --release-version "1.20.5-a3e8bda1"

will upgrade the cluster as per expectations, the release version won't be bumped to 1.21.1-82691b51.

AndreiBanaruTakeda avatar Sep 05 '24 18:09 AndreiBanaruTakeda

why are you doing this:

  ami_release_version  = data.aws_ssm_parameter.image_version[0].value
  ...
}

data "aws_ssm_parameter" "image_version" {
  count = var.ami_release_version != null ? 1 : 0
  name  = "/aws/service/bottlerocket/aws-k8s-${module.eks.cluster_version}/x86_64/${var.ami_release_version}/image_version"
}

instead of this:

  ami_release_version  = var.ami_release_version
  ...
}

bryantbiggs avatar Sep 05 '24 19:09 bryantbiggs

Personal preference.

I like it simple: 1.20.5 instead of 1.20.5-a3e8bda1.

I'm open to flip it if that causes the issue.

AndreiBanaruTakeda avatar Sep 05 '24 20:09 AndreiBanaruTakeda

I don't follow - you are inputting the value of 1.20.5-a3e8bda1 via the ami_release_version variable, only to look it up from the SSM parameter and get the exact same value back. If you already know the release version, just use it as a string and pass it to the input

bryantbiggs avatar Sep 05 '24 21:09 bryantbiggs

I am inputting the value of 1.20.5 via the ami_release_version variable, and then the SSM parameter resolves it to the extended format, which I then use in the eks-managed-node-group module.

aws ssm get-parameter --name "/aws/service/bottlerocket/aws-k8s-1.30/x86_64/1.20.5/image_version" --region us-east-1 --query "Parameter.Value" --output text

AndreiBanaruTakeda avatar Sep 06 '24 07:09 AndreiBanaruTakeda

There are two paths published in SSM to retrieve the image_version:

/aws/service/bottlerocket/aws-k8s-1.30/x86_64/1.20.5/image_version
/aws/service/bottlerocket/aws-k8s-1.30/x86_64/1.20.5-a3e8bda1/image_version

AndreiBanaruTakeda avatar Sep 06 '24 07:09 AndreiBanaruTakeda

thats not what your reproduction details provided above show

image

bryantbiggs avatar Sep 06 '24 13:09 bryantbiggs

I wasn't sufficiently clear. Sorry about that. Those values you've just pointed out, are the ones supplied to the eks-managed-node-group child module. The resolved ones, if we were to say it like this.

The reproduction code, which I added as an edit to the opened issue shows that I'm passing the short version of the version:

variable "ami_release_version" {
  type    = string
  default = "1.20.5"
}

AndreiBanaruTakeda avatar Sep 06 '24 14:09 AndreiBanaruTakeda

Can this be acknowledged as a bug?

AndreiBanaruTakeda avatar Oct 01 '24 07:10 AndreiBanaruTakeda

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] avatar Nov 01 '24 00:11 github-actions[bot]

This issue was automatically closed because of stale in 10 days

github-actions[bot] avatar Nov 12 '24 00:11 github-actions[bot]

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Dec 12 '24 02:12 github-actions[bot]