terraform-aws-eks I/O Timeout when trying to create the aws-auth configmap

Description

In the latest versions of this module, the aws-auth ConfigMap is manipulated, but there is no mechanism that waits for the cluster to be ready to accept connections. This leads to I/O Timeout errors as seen in the screenshot below.

What's frustrating is, this doesn't happen all the time. It DOES happen frequently (but not always) in our use case of:

Kubernetes public endpoint is disabled
Self-managed nodegroups only

This is clear evidence that there is a race condition happening. Sometimes Terraform will try to create the aws-auth configmap and the EKS cluster will have had enough time to become ready, and the terraform apply succeeds. Other times Terraform will try to create the configmap and it will fail because the cluster isn't ready to accept connections yet.

In previous versions of this module we've seen:

A null_resource with a local_exec that uses curl
Use of the HTTP provider to wait for the cluster to be ready
Management of the aws-auth configmap removed completely
Management of the aws-auth configmap brought back, but without any "wait_for_cluster_ready" mechanism

Versions

Module version [Required]: v19.11.0
Terraform version: v1.3.9
Provider version(s):

+ provider registry.terraform.io/hashicorp/aws v4.59.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.18.1
+ provider registry.terraform.io/hashicorp/time v0.9.1
+ provider registry.terraform.io/hashicorp/tls v4.0.4

Reproduction Code [Required]

I haven't had time to create a minimal example. I can go do so if necessary. I'm hoping the issue is self-evident here as this is something that has been a common issue for as long as Terraform has been used to deploy EKS clusters.

Steps to reproduce the behavior:

Deploy this module with the public k8s endpoint disabled and with self-managed nodegroups only. Set both create_aws_auth_configmap and manage_aws_auth_configmap to true.

Expected behavior

The aws-auth configmap is always created successfully

Actual behavior

Sometimes a failure occurs due to the cluster not being ready to accept connections yet

Terminal Output Screenshot(s)

Additional context

We are previous users of https://github.com/aws-ia/terraform-aws-eks-blueprints but given the new direction that repo is going we have switched over to using this module directly, however due to this issue things are broken for our use case due to this race condition.

Mar 29 '23 21:03 RothAndrew

It's pretty hacky but we wrote up this and it seems to work. Still testing. Still would prefer for the module to handle this itself rather than backdooring it.

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "/bin/sh"
    args        = ["-c", "for i in $(seq 1 30); do curl -s -k -f ${module.eks.cluster_endpoint}/healthz > /dev/null && break || sleep 10; done && aws eks --region ${var.region} get-token --cluster-name ${local.cluster_name}"]
  }
}

Mar 30 '23 22:03 RothAndrew

The module does not package providers, users need to supply the provider

Mar 30 '23 22:03 bryantbiggs

correct, we have the above code in our root module, next to our module "eks" block

Mar 30 '23 22:03 RothAndrew

So what are you proposing that the module should handle?

Mar 30 '23 22:03 bryantbiggs

It should handle checking that the EKS cluster is ready to accept connections before it tries to do stuff with it. the above code is an ugly hack to get around the fact that the module doesn't check before trying to create the aws-auth configmap

Mar 30 '23 22:03 RothAndrew

And how would we do that in Terraform?

Mar 30 '23 22:03 bryantbiggs

This very module used to do it 2 different ways. Most recently it did it using the HTTP Provider: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/v17.24.0/data.tf#L92

Or by using a null_resource: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/v15.2.0/cluster.tf#L67

EKS Blueprints uses the HTTP provider: https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/8fa9a62d6e08afc0be1467c601796e4ddf73a2b2/data.tf#L10

CloudPosse uses a null_resource: https://github.com/cloudposse/terraform-aws-eks-cluster/blob/fa9667a1f63e5f4140c0e964bbf9e5bee0a43215/auth.tf#L63

None of the options are perfect, they all have one issue or another, but I would certainly argue that doing any of them is better than doing nothing, which results in users being confused and frustrated when their stuff doesn't work and they don't know why

Mar 30 '23 22:03 RothAndrew

Got it. Let's chat about this again in about a month 😬

Mar 30 '23 22:03 bryantbiggs

it looks like more maintainers are needed 😬

Mar 31 '23 01:03 jamengual

What makes you say that?

Mar 31 '23 01:03 bryantbiggs

the month wait you mentioned earlier

Mar 31 '23 01:03 jamengual

Have optimism, empathy. Perhaps that's not a month of waiting on me, maybe we are waiting for better things that are near

Mar 31 '23 01:03 bryantbiggs

can you expand on that? are you thinking on implementing another solution?

Mar 31 '23 01:03 jamengual

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

May 01 '23 00:05 github-actions[bot]

Got it. Let's chat about this again in about a month 😬

Any update?

May 01 '23 01:05 RothAndrew

@bryantbiggs Is this what we were waiting for? 🙏 https://github.com/hashicorp/terraform-provider-http/releases/tag/v3.3.0

May 01 '23 17:05 zack-is-cool

The timeline has slipped a bit but the changes are coming soon - stay tuned

May 17 '23 13:05 bryantbiggs

What changes?

May 17 '23 15:05 RothAndrew

just stay tuned - its proper changes, not hacky changes

May 17 '23 15:05 bryantbiggs

well I guess it was more than a month.

May 17 '23 16:05 jamengual

I don't love being kept in the dark. Is there any nugget of insight you can give as to what is coming?

May 17 '23 16:05 RothAndrew

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

Jun 17 '23 00:06 github-actions[bot]

Unstale

Jun 17 '23 00:06 RothAndrew

This possibly explains why I am getting the below error when setting create_aws_auth_configmap = true

 "msg": "\nError: Post \"https://xxxxxxxxxxxxx.gr7.eu-west-1.eks.amazonaws.com/api/v1/namespaces/kube-system/configmaps\": dial tcp 100.68.84.210:443: i/o timeout\n\n  with module.eks.kubernetes_config_map.aws_auth[0],\n  on .terraform/modules/eks/main.tf line 536, in resource \"kubernetes_config_map\" \"aws_auth\":\n 536: resource \"kubernetes_config_map\" \"aws_auth\" {",
    "rc": 1,

Seems like this issue is in some ways related to this one too :

https://github.com/terraform-aws-modules/terraform-aws-eks/issues/2525#issuecomment-1623720769

Jul 07 '23 14:07 dev-travelex

I just upgraded terraform-aws-eks from 18.x.x to 19.15.3 and am now running into the i/o timeout issue when trying to run kubectl commands (and hitting /api/v1/namespaces/kube-system/configmaps/aws-auth via the public endpoint).

From my perspective, there was an undocumented breaking change made in 19.x.x by changing the default of cluster_endpoint_public_access from true to false.

Jul 11 '23 20:07 deasydoesit

Yeah, in my case, I was able to find a resolution by going into the AWS console, to my EKS cluster, and updating the networking to allow public access to the managed Kubernetes API server.

Jul 11 '23 21:07 deasydoesit

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

Aug 11 '23 00:08 github-actions[bot]

Unstale

Aug 11 '23 00:08 RothAndrew

Same issue met. Any update?

Aug 29 '23 10:08 Joldnine

this will be resolved once https://github.com/aws/containers-roadmap/issues/185#issuecomment-1569048009 arrives in EKS

Aug 29 '23 11:08 bryantbiggs

terraform-aws-eks terraform-aws-eks copied to clipboard

I/O Timeout when trying to create the aws-auth configmap

Description

Versions

Reproduction Code [Required]

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

terraform-aws-eks
terraform-aws-eks copied to clipboard