terraform-aws-eks icon indicating copy to clipboard operation
terraform-aws-eks copied to clipboard

Infinite Plan Update on eks_managed_node_group for launch_template version -> $Default

Open sinkr opened this issue 1 year ago β€’ 2 comments

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration (see the examples/* directory for references that you can copy+paste and tailor to match your configs if you are unable to copy your exact configuration). The reproduction MUST be executable by running terraform init && terraform apply without any further changes.

If your request is for a new feature, please use the Feature request template.

  • [x] βœ‹ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Before you submit an issue, please perform the following first:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

  • Module version [Required]: v20.24.0

  • Terraform version: Terraform v1.9.5 on darwin_arm64

  • Provider version(s): Terraform v1.9.5 on darwin_arm64

  • provider registry.terraform.io/alekc/kubectl v2.0.4
  • provider registry.terraform.io/gavinbunney/kubectl v1.14.0
  • provider registry.terraform.io/hashicorp/aws v5.64.0
  • provider registry.terraform.io/hashicorp/cloudinit v2.3.4
  • provider registry.terraform.io/hashicorp/helm v2.14.1
  • provider registry.terraform.io/hashicorp/kubernetes v2.31.0
  • provider registry.terraform.io/hashicorp/null v3.2.2
  • provider registry.terraform.io/hashicorp/time v0.12.0
  • provider registry.terraform.io/hashicorp/tls v4.0.5
  • provider registry.terraform.io/terraform-aws-modules/http v2.4.1

Reproduction Code [Required]

node-groups.tf:

module "general_worker_nodes" {
  source  = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
  version = "v20.24.0"

  cluster_name                           = var.eks_cluster_name
  cluster_primary_security_group_id      = module.eks.cluster_primary_security_group_id
  cluster_version                        = var.eks_cluster_version
  cluster_service_cidr                   = module.eks.cluster_service_cidr
  create_iam_role                        = false
  create_launch_template                 = false
  iam_role_arn                           = aws_iam_role.general_worker_nodes.arn
  launch_template_id                     = aws_launch_template.general_worker_nodes.id
  name                                   = local.short_node_group_name_prefix
  subnet_ids                             = data.terraform_remote_state.vpc.outputs.private_subnets
  use_custom_launch_template             = true
  update_launch_template_default_version = false
  vpc_security_group_ids                 = [data.terraform_remote_state.vpc.outputs.internal_subnet_id]

  max_size     = var.eks_nodegroups["general"].max_size
  min_size     = var.eks_nodegroups["general"].min_size
  desired_size = var.eks_nodegroups["general"].desired_size

  instance_types = var.eks_nodegroups["general"].instance_types
  ami_type       = var.eks_nodegroups["general"].ami_type
  capacity_type  = var.eks_nodegroups["general"].capacity_type

  labels = {
    "nodegroup"   = "general",
    "environment" = data.terraform_remote_state.vpc.outputs.vpc_name_short
  }

  pre_bootstrap_user_data = <<-EOT
#!/bin/bash
mkdir -m 0600 -p ~/.ssh
touch ~ec2-user/.ssh/authorized_keys
cat >> ~ec2-user/.ssh/authorized_keys <<EOF
${data.terraform_remote_state.vpc.outputs.vpc_ssh_key}
EOF
  EOT

  tags = {
    "Name"                                          = "${var.eks_cluster_name}-Gen-EKS-Worker-Nodes"
    "efs.csi.aws.com/cluster"                       = "true"
    "kubernetes.io/cluster/${var.eks_cluster_name}" = "owned"
    "aws-node-termination-handler/managed"          = "true"
  }
}

launch-templates.tf:

resource "aws_launch_template" "general_worker_nodes" {
  update_default_version = true
  key_name               = var.eks_nodegroups["general"].ssh_key_name
  vpc_security_group_ids = [data.terraform_remote_state.vpc.outputs.internal_subnet_id]

  ebs_optimized = true

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size = var.eks_nodegroups["general"].disk_size
      encrypted   = true
    }
  }

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
  }

  monitoring {
    enabled = true
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      "Name"                                          = "${var.eks_cluster_name}-General-EKS-Worker-Nodes"
      "efs.csi.aws.com/cluster"                       = "true"
      "kubernetes.io/cluster/${var.eks_cluster_name}" = "owned"
      "aws-node-termination-handler/managed"          = "true"
    }
  }

  tag_specifications {
    resource_type = "volume"
    tags = {
      "Name"                                          = "${var.eks_cluster_name}-General-EKS-Worker-Nodes"
      "kubernetes.io/cluster/${var.eks_cluster_name}" = "owned"
    }
  }

  tag_specifications {
    resource_type = "network-interface"
    tags = {
      "Name"                                          = "${var.eks_cluster_name}-General-EKS-Worker-Nodes"
      "kubernetes.io/cluster/${var.eks_cluster_name}" = "owned"
    }
  }
}

auto.tfvars:

aws_region          = "us-east-2"
eks_cluster_name    = "development"
eks_cluster_version = "1.30"
eks_nodegroups = {
  general = {
    instance_types = [
      "c6a.xlarge",
      "m6a.xlarge",
      "m6a.2xlarge",
      "c5.xlarge",
      "m5.xlarge",
      "c4.xlarge",
      "m4.xlarge"
    ]
    ami_type                   = "AL2_x86_64"
    capacity_type              = "SPOT"
    desired_size               = 8
    disk_size                  = 128
    enabled                    = true
    max_size                   = 36
    max_unavailable_percentage = 25
    min_size                   = 4
    nodeselector               = "general"
    ssh_key_name               = "MyCompany Staging"
  }
}

Steps to reproduce the behavior:

  1. terraform workspace select development-us-east-2-<redacted>
  2. terraform init -upgrade
  3. terraform apply

Workspaces: Yes.

Cleared cache: Yes.

Steps to issue:

  1. terraform workspace select development-us-east-2-<redacted>
  2. terraform init -upgrade
  3. terraform apply

Expected behavior

Once applied, the plan should never attempt to update launch_template version from its current version to $Default.

Actual behavior

The plan continues to want to update the launch template's version for every run:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.findigs-eks.module.general_worker_nodes.aws_eks_node_group.this[0] will be updated in-place
  ~ resource "aws_eks_node_group" "this" {
        id                     = "development:development-Gen-EKS-Worker-Nodes-20240807005426470100000001"
        tags                   = {
            "Name"                                 = "development-Gen-EKS-Worker-Nodes"
            "aws-node-termination-handler/managed" = "true"
            "efs.csi.aws.com/cluster"              = "true"
            "kubernetes.io/cluster/development"    = "owned"
        }
        # (16 unchanged attributes hidden)

      ~ launch_template {
            id      = "lt-0f09c225dd95a124d"
            name    = "terraform-20220519232718013800000003"
          ~ version = "11" -> "$Default"
        }

        # (3 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Terminal Output Screenshot(s)

image

sinkr avatar Aug 28 '24 13:08 sinkr

that is a lot of interesting configurations - may I ask why you are approaching your configuration from this perspective? meaning:

  1. Why use the node group sub-module independent of the overall EKS module?
  2. Why use a custom launch template outside of the module when the module already supports a custom launch template that is "safer" for EKS?

bryantbiggs avatar Aug 28 '24 22:08 bryantbiggs

Hi Bryant, thank you for the response.

I can't definitively speak to why I ended up on this combo, however, loosely I think it had to do with not getting the correct disk_size and wanting the tagging to take place on all attached entities, like ENIs, EBS, etc.

IIRC, I feel like the only way I was able to get that combo was with this above, but perhaps some other iteration would work, however, I think long-term I'm moving towards Karpenter.

After many hours of debugging, I found that if I explicitly set launch_template_version to the current integer value, the infinite plan goes away, however, I do feel like there's an opportunity here to add logic such that we do not go to $Default unnecessarily.

My initial hypothesis was that the latest (or specified) launch template revision wasn't tagged as Default, however it was.

sinkr avatar Aug 29 '24 17:08 sinkr

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] avatar Sep 29 '24 00:09 github-actions[bot]

This issue was automatically closed because of stale in 10 days

github-actions[bot] avatar Oct 10 '24 00:10 github-actions[bot]

Just what the doctor ordered...not!

sinkr avatar Oct 10 '24 00:10 sinkr

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Nov 09 '24 02:11 github-actions[bot]