terraform-aws-eks icon indicating copy to clipboard operation
terraform-aws-eks copied to clipboard

Using own networking; nodes are unable to join cluster

Open utahcon opened this issue 1 year ago β€’ 2 comments

Description

I have two (eks, eks-home) clusters I am setting up, one using the VPC from the included VPC module, the other using VPC and Subnets imported from existing resources.

  • [X] βœ‹ I have searched the open/closed issues and my issue is not listed.

Versions

  • Module version [Required]: 20.2.1

  • Terraform version: 1.7.3

  • Provider version(s):

    • provider registry.terraform.io/hashicorp/aws v5.35.0
    • provider registry.terraform.io/hashicorp/cloudinit v2.3.3
    • provider registry.terraform.io/hashicorp/random v3.5.1
    • provider registry.terraform.io/hashicorp/time v0.10.0
    • provider registry.terraform.io/hashicorp/tls v4.0.5

Reproduction Code [Required]

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.1"

  name = "education-vpc"

  cidr = "10.0.0.0/16"
  azs  = slice(data.aws_availability_zones.available.names, 0, 3)

  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  public_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/elb"                      = 1
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/internal-elb"             = 1
  }
}

module "eks" {
  source                         = "terraform-aws-modules/eks/aws"
  version                        = "20.2.1"
  cluster_name                   = local.cluster_name
  cluster_version                = "1.29"
  vpc_id                         = module.vpc.vpc_id
  subnet_ids                     = module.vpc.private_subnets
  cluster_endpoint_public_access = true

  eks_managed_node_group_defaults = {
    ami_type = "AL2_x86_64"
  }

  eks_managed_node_groups = {
    one = {
      name           = "node-group-1"
      instance_types = ["t3.small"]
      min_size       = 1
      max_size       = 3
      desired_size   = 2
    }

    two = {
      name           = "node-group-2"
      instance_types = ["t3.small"]
      min_size       = 1
      max_size       = 2
      desired_size   = 1
    }
  }
}

module "eks-home" {
  source                         = "terraform-aws-modules/eks/aws"
  version                        = "20.2.1"
  cluster_name                   = "${local.cluster_name}-home"
  cluster_version                = "1.29"
  vpc_id                         = data.aws_vpc.core.id
  subnet_ids                     = ["subnet-04f23eb8f54a20e62", "subnet-0b13245e8c4df7d08", "subnet-00d031043f0b62c5c"]
  cluster_endpoint_public_access = true

  eks_managed_node_group_defaults = {
    ami_type = "AL2_x86_64"
  }

  eks_managed_node_groups = {
    one = {
      name           = "node-group-1-home"
      instance_types = ["t3.small"]
      min_size       = 1
      max_size       = 3
      desired_size   = 2
    }

    two = {
      name           = "node-group-2-home"
      instance_types = ["t3.small"]
      min_size       = 1
      max_size       = 2
      desired_size   = 1
    }
  }
}

Steps to reproduce the behavior:

When the above HCL runs it builds two clusters, one with a fresh VPC/subnets/sgs/nacls/etc., the other from my existing VPC.

The eks-home cluster nodes are unable to join the cluster. I can't seem to figure out what I am missing and looking for help.

Expected behavior

I expect both clusters to work, just one on an existing VPC.

Actual behavior

eks cluster with fresh VPC/subnets works, eks-home using the existing VPC does not work.

Terminal Output Screenshot(s)

β”‚ Error: waiting for EKS Node Group (education-eks-home:node-group-1-home-20240213193545975900000001) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: i-064d226aa7e41f356, i-0a23e706eb655fd18: NodeCreationFailure: Instances failed to join the kubernetes cluster
β”‚ 
β”‚   with module.eks-home.module.eks_managed_node_group["one"].aws_eks_node_group.this[0],
β”‚   on .terraform/modules/eks-home/modules/eks-managed-node-group/main.tf line 308, in resource "aws_eks_node_group" "this":
β”‚  308: resource "aws_eks_node_group" "this" {
β”‚ 

Additional context

EKS VPC Details

    Network ACL:
        acl-09499c0beba579e3a
            Ingress:
                100 ALL:ALL 0/0     Allow
                101 ALL:ALL ::/0    Allow
            Egress:
                100 ALL:ALL 0/0     Allow
                101 ALL:ALL ::/0    Allow
    NAT Gateway:
        nat-0334dc8ee1978a397   Public
    Route Table:
        rtb-0ec0db41ee880c262
            Routes:
                10.0.0.0/16 local
                0.0.0.0/0   nat-0334dc8ee1978a397
            Subnets:
                subnet-066405cc0c25e6179
                subnet-030f7f95759f01724
                subnet-052c037ab22f00738
	Subnets
	    subnet-066405cc0c25e6179
	        rtb-0ec0db41ee880c262
	        acl-09499c0beba579e3a
	    subnet-030f7f95759f01724
	        rtb-0ec0db41ee880c262
	        acl-09499c0beba579e3a
        subnet-066405cc0c25e6179
            rtb-0ec0db41ee880c262
            acl-09499c0beba579e3a
	Security Groups
        sg-0f914078f036f4f41
            Ingress:
                ALL:ALL		sg-0f914078f036f4f41
            Egress:
                ALL:ALL		0.0.0.0/0
        sg-0073293245dc624a0
            Ingress:
                TCP:443     sg-02db6c4108fad1284
        sg-02db6c4108fad1284
            Ingress:
                TCP:53 		sg-02db6c4108fad1284
                UDP:53 		sg-02db6c4108fad1284
                TCP:443 	sg-0073293245dc624a0
                TCP:1025-* 	sg-02db6c4108fad1284
                TCP:4443 	sg-0073293245dc624a0
                TCP:6443 	sg-0073293245dc624a0
                TCP:8443 	sg-0073293245dc624a0
                TCP:9443 	sg-0073293245dc624a0
                TCP:10250	sg-0073293245dc624a0
            Egress:
                ALL:ALL     0.0.0.0/0

EKS Home VPC Details

    Network ACL:
        acl-0d9f7b0672432b622
            Ingress:
                100 ALL:ALL 0/0     Allow
                101 ALL:ALL ::/0    Allow
            Egress:
                100 ALL:ALL 0/0     Allow
                101 ALL:ALL ::/0    Allow
    NAT Gateways:
        nat-041d97a1b56381127   Public
    Route Table:
        rtb-0c3e8df22f7257041
            Routes:
                10.0.0.0/16 local
                0.0.0.0/0   nat-041d97a1b56381127
            Subnets:
                subnet-04f23eb8f54a20e62
                subnet-00d031043f0b62c5c
                subnet-0b13245e8c4df7d08
    Subnets
        subnet-04f23eb8f54a20e62
            rtb-0c3e8df22f7257041
            acl-0d9f7b0672432b622
        subnet-0b13245e8c4df7d08
            rtb-0c3e8df22f7257041
            acl-0d9f7b0672432b622
        subnet-00d031043f0b62c5c
            rtb-0c3e8df22f7257041
            acl-0d9f7b0672432b622
    Security Groups
	    sg-057bf55b6cbf41ecd
		    Ingress:
			    ALL:ALL     sg-057bf55b6cbf41ecd
		    Egress:
			    ALL:ALL		0.0.0.0/0
        sg-098ae5c5728e4b68b
            Ingress:
                TCP:443     sg-02e487176f2b42374
        sg-02e487176f2b42374
            Ingress:
                TCP:53		sg-02e487176f2b42374
                UDP:53 		sg-02e487176f2b42374
                TCP:443 	sg-098ae5c5728e4b68b
                TCP:1025-*	sg-02e487176f2b42374
                TCP:4443 	sg-098ae5c5728e4b68b
                TCP:6443 	sg-098ae5c5728e4b68b
                TCP:8443 	sg-098ae5c5728e4b68b
                TCP:9443 	sg-098ae5c5728e4b68b
                TCP:10250   sg-098ae5c5728e4b68b
            Egress:
                ALL:ALL		0.0.0.0/0

utahcon avatar Feb 13 '24 20:02 utahcon

I'm having a similar issue,

β”‚ Error: waiting for EKS Node Group (my-cluser:my-eks-managed-node-group-20240217064513134700000005) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: i-0738caa5a870a1536, i-0ac0190c2a8b18923: NodeCreationFailure: Instances failed to join the kubernetes cluster
β”‚ 
β”‚   with module.my-cluster.module.my-eks.module.eks_managed_node_group["eks_mng"].aws_eks_node_group.this[0],
β”‚   on .terraform/modules/my-cluster.my-eks/modules/eks-managed-node-group/main.tf line 308, in resource "aws_eks_node_group" "this":
β”‚  308: resource "aws_eks_node_group" "this" {

Did you find anything in CloudTrail? I'm starting to wonder if AWS changed something on their side of the API.

CloudTrail is telling me the instance profile name is invalid,

"errorMessage": "You must use a valid fully-formed launch template. Value (eks-d8c6d9f9-90bb-c537-cb25-40c5255cf213) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name",

even though it seems fine. I can dry-run an instance with the launch template...

(0)$ aws ec2 run-instances --launch-template LaunchTemplateName=eks-d8c6d9f9-90bb-c537-cb25-40c5255cf213,Version='1' --dry-run --subnet-id subnet-0e18a11f29685192e --profile pit-dev

An error occurred (DryRunOperation) when calling the RunInstances operation: Request would have succeeded, but DryRun flag is set.

I'm going to write my own issue with more details and I'll link it here.

gmisura avatar Feb 17 '24 19:02 gmisura

I'm also getting the same error while using an existing private vpc+subnet. Any ideas how to fix it? It's the first time I'm using this module

Error: waiting for EKS Node Group (Anomalo_EKS:general-20240223232200053200000001) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: i-0043619eeebf2f8dd: NodeCreationFailure: Instances failed to join the kubernetes cluster β”‚ β”‚ with module.eks.module.eks_managed_node_group["general"].aws_eks_node_group.this[0], β”‚ on .terraform/modules/eks/modules/eks-managed-node-group/main.tf line 308, in resource "aws_eks_node_group" "this": β”‚ 308: resource "aws_eks_node_group" "this" {

edkarlin avatar Feb 24 '24 00:02 edkarlin

Info?

dantot avatar Mar 01 '24 17:03 dantot

Just started looking at the v20.x terraform-aws-modules/eks/aws module. I too cannot add managed node groups to an existing vpc. I ssh'd into the ec2 instance of the node group and there's errors about containerd and the cni, along with 403's trying to pull the eks containers from ECR. I have the correct policies attached to the role.

v19 of the module has no issues with managed node groups with existing vpcs.

kevinchiu-mlse avatar Mar 11 '24 16:03 kevinchiu-mlse

following up on my comment, I solved the issue where the nodes won't join the cluster in an existing VPC. I found it when reviewing the launch template and the EC2 user data was blank. Then comparing to old node groups sg, I saw the new nodes were missing the primary security group.

in the managed node group config, I need to explicitly set enable_bootstrap_user_data=true and attach_cluster_primary_security_group=true.

  eks_managed_node_groups = {
    # Managed Node groups with minimum config
    group1 = {
      name               = "group1"
      use_name_prefix    = true
      enable_efa_support = false
      ami_type           = "AL2_x86_64"
      ami_id             = data.aws_ami.eks_default.image_id
      cluster_name       = local.name

      enable_bootstrap_user_data            = true
      attach_cluster_primary_security_group = true

      instance_types  = ["m5.xlarge"]
      min_size        = 1
      max_size        = 4
      desired_size    = 1
      create_iam_role = true 
      disk_size       = 50 
      update_config = {
        max_unavailable_percentage = 30
      }
      subnet_ids = data.terraform_remote_state.vpc.outputs.private_subnet_ids

    }

kevinchiu-mlse avatar Mar 12 '24 19:03 kevinchiu-mlse

yes, on managed nodegroups, if you use a custom AMI you must provide the user data and we make that easier for you by setting the flag enable_bootstrap_user_data.

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/user_data.md

Also, this would be a minimum config for what you provided - a lot of what you provided is already the default or not required:

eks_managed_node_groups = {
    group1 = {
      ami_type = "AL2_x86_64"
      ami_id   = data.aws_ami.eks_default.image_id

      enable_bootstrap_user_data = true

      instance_types  = ["m5.xlarge"]
      min_size        = 1
      max_size        = 4
      desired_size    = 1
      
      subnet_ids = data.terraform_remote_state.vpc.outputs.private_subnet_ids
    }

bryantbiggs avatar Mar 12 '24 20:03 bryantbiggs

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] avatar Apr 12 '24 00:04 github-actions[bot]

This issue was automatically closed because of stale in 10 days

github-actions[bot] avatar Apr 22 '24 00:04 github-actions[bot]

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar May 22 '24 02:05 github-actions[bot]