terraform-aws-eks icon indicating copy to clipboard operation
terraform-aws-eks copied to clipboard

I/O Timeout when trying to create the aws-auth configmap

Open RothAndrew opened this issue 2 years ago β€’ 40 comments

Description

In the latest versions of this module, the aws-auth ConfigMap is manipulated, but there is no mechanism that waits for the cluster to be ready to accept connections. This leads to I/O Timeout errors as seen in the screenshot below.

What's frustrating is, this doesn't happen all the time. It DOES happen frequently (but not always) in our use case of:

  • Kubernetes public endpoint is disabled
  • Self-managed nodegroups only

This is clear evidence that there is a race condition happening. Sometimes Terraform will try to create the aws-auth configmap and the EKS cluster will have had enough time to become ready, and the terraform apply succeeds. Other times Terraform will try to create the configmap and it will fail because the cluster isn't ready to accept connections yet.

In previous versions of this module we've seen:

  • A null_resource with a local_exec that uses curl
  • Use of the HTTP provider to wait for the cluster to be ready
  • Management of the aws-auth configmap removed completely
  • Management of the aws-auth configmap brought back, but without any "wait_for_cluster_ready" mechanism

Versions

  • Module version [Required]: v19.11.0

  • Terraform version: v1.3.9

  • Provider version(s):

+ provider registry.terraform.io/hashicorp/aws v4.59.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.18.1
+ provider registry.terraform.io/hashicorp/time v0.9.1
+ provider registry.terraform.io/hashicorp/tls v4.0.4

Reproduction Code [Required]

I haven't had time to create a minimal example. I can go do so if necessary. I'm hoping the issue is self-evident here as this is something that has been a common issue for as long as Terraform has been used to deploy EKS clusters.

Steps to reproduce the behavior:

Deploy this module with the public k8s endpoint disabled and with self-managed nodegroups only. Set both create_aws_auth_configmap and manage_aws_auth_configmap to true.

Expected behavior

The aws-auth configmap is always created successfully

Actual behavior

Sometimes a failure occurs due to the cluster not being ready to accept connections yet

Terminal Output Screenshot(s)

image

Additional context

We are previous users of https://github.com/aws-ia/terraform-aws-eks-blueprints but given the new direction that repo is going we have switched over to using this module directly, however due to this issue things are broken for our use case due to this race condition.

RothAndrew avatar Mar 29 '23 21:03 RothAndrew

It's pretty hacky but we wrote up this and it seems to work. Still testing. Still would prefer for the module to handle this itself rather than backdooring it.

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "/bin/sh"
    args        = ["-c", "for i in $(seq 1 30); do curl -s -k -f ${module.eks.cluster_endpoint}/healthz > /dev/null && break || sleep 10; done && aws eks --region ${var.region} get-token --cluster-name ${local.cluster_name}"]
  }
}

RothAndrew avatar Mar 30 '23 22:03 RothAndrew

The module does not package providers, users need to supply the provider

bryantbiggs avatar Mar 30 '23 22:03 bryantbiggs

correct, we have the above code in our root module, next to our module "eks" block

RothAndrew avatar Mar 30 '23 22:03 RothAndrew

So what are you proposing that the module should handle?

bryantbiggs avatar Mar 30 '23 22:03 bryantbiggs

It should handle checking that the EKS cluster is ready to accept connections before it tries to do stuff with it. the above code is an ugly hack to get around the fact that the module doesn't check before trying to create the aws-auth configmap

RothAndrew avatar Mar 30 '23 22:03 RothAndrew

And how would we do that in Terraform?

bryantbiggs avatar Mar 30 '23 22:03 bryantbiggs

This very module used to do it 2 different ways. Most recently it did it using the HTTP Provider: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/v17.24.0/data.tf#L92

Or by using a null_resource: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/v15.2.0/cluster.tf#L67

EKS Blueprints uses the HTTP provider: https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/8fa9a62d6e08afc0be1467c601796e4ddf73a2b2/data.tf#L10

CloudPosse uses a null_resource: https://github.com/cloudposse/terraform-aws-eks-cluster/blob/fa9667a1f63e5f4140c0e964bbf9e5bee0a43215/auth.tf#L63

None of the options are perfect, they all have one issue or another, but I would certainly argue that doing any of them is better than doing nothing, which results in users being confused and frustrated when their stuff doesn't work and they don't know why

RothAndrew avatar Mar 30 '23 22:03 RothAndrew

Got it. Let's chat about this again in about a month 😬

bryantbiggs avatar Mar 30 '23 22:03 bryantbiggs

it looks like more maintainers are needed 😬

jamengual avatar Mar 31 '23 01:03 jamengual

What makes you say that?

bryantbiggs avatar Mar 31 '23 01:03 bryantbiggs

the month wait you mentioned earlier

jamengual avatar Mar 31 '23 01:03 jamengual

Have optimism, empathy. Perhaps that's not a month of waiting on me, maybe we are waiting for better things that are near

bryantbiggs avatar Mar 31 '23 01:03 bryantbiggs

can you expand on that? are you thinking on implementing another solution?

jamengual avatar Mar 31 '23 01:03 jamengual

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] avatar May 01 '23 00:05 github-actions[bot]

Got it. Let's chat about this again in about a month 😬

Any update?

RothAndrew avatar May 01 '23 01:05 RothAndrew

@bryantbiggs Is this what we were waiting for? πŸ™ https://github.com/hashicorp/terraform-provider-http/releases/tag/v3.3.0

zack-is-cool avatar May 01 '23 17:05 zack-is-cool

The timeline has slipped a bit but the changes are coming soon - stay tuned

bryantbiggs avatar May 17 '23 13:05 bryantbiggs

What changes?

RothAndrew avatar May 17 '23 15:05 RothAndrew

just stay tuned - its proper changes, not hacky changes

bryantbiggs avatar May 17 '23 15:05 bryantbiggs

well I guess it was more than a month.

jamengual avatar May 17 '23 16:05 jamengual

I don't love being kept in the dark. Is there any nugget of insight you can give as to what is coming?

RothAndrew avatar May 17 '23 16:05 RothAndrew

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] avatar Jun 17 '23 00:06 github-actions[bot]

Unstale

RothAndrew avatar Jun 17 '23 00:06 RothAndrew

This possibly explains why I am getting the below error when setting create_aws_auth_configmap = true

 "msg": "\nError: Post \"https://xxxxxxxxxxxxx.gr7.eu-west-1.eks.amazonaws.com/api/v1/namespaces/kube-system/configmaps\": dial tcp 100.68.84.210:443: i/o timeout\n\n  with module.eks.kubernetes_config_map.aws_auth[0],\n  on .terraform/modules/eks/main.tf line 536, in resource \"kubernetes_config_map\" \"aws_auth\":\n 536: resource \"kubernetes_config_map\" \"aws_auth\" {",
    "rc": 1,

Seems like this issue is in some ways related to this one too :

https://github.com/terraform-aws-modules/terraform-aws-eks/issues/2525#issuecomment-1623720769

dev-travelex avatar Jul 07 '23 14:07 dev-travelex

I just upgraded terraform-aws-eks from 18.x.x to 19.15.3 and am now running into the i/o timeout issue when trying to run kubectl commands (and hitting /api/v1/namespaces/kube-system/configmaps/aws-auth via the public endpoint).

From my perspective, there was an undocumented breaking change made in 19.x.x by changing the default of cluster_endpoint_public_access from true to false.

deasydoesit avatar Jul 11 '23 20:07 deasydoesit

Yeah, in my case, I was able to find a resolution by going into the AWS console, to my EKS cluster, and updating the networking to allow public access to the managed Kubernetes API server.

deasydoesit avatar Jul 11 '23 21:07 deasydoesit

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] avatar Aug 11 '23 00:08 github-actions[bot]

Unstale

RothAndrew avatar Aug 11 '23 00:08 RothAndrew

Same issue met. Any update?

Joldnine avatar Aug 29 '23 10:08 Joldnine

this will be resolved once https://github.com/aws/containers-roadmap/issues/185#issuecomment-1569048009 arrives in EKS

bryantbiggs avatar Aug 29 '23 11:08 bryantbiggs