terraform-aws-eks
terraform-aws-eks copied to clipboard
I/O Timeout when trying to create the aws-auth configmap
Description
In the latest versions of this module, the aws-auth ConfigMap is manipulated, but there is no mechanism that waits for the cluster to be ready to accept connections. This leads to I/O Timeout errors as seen in the screenshot below.
What's frustrating is, this doesn't happen all the time. It DOES happen frequently (but not always) in our use case of:
- Kubernetes public endpoint is disabled
- Self-managed nodegroups only
This is clear evidence that there is a race condition happening. Sometimes Terraform will try to create the aws-auth configmap and the EKS cluster will have had enough time to become ready, and the terraform apply succeeds. Other times Terraform will try to create the configmap and it will fail because the cluster isn't ready to accept connections yet.
In previous versions of this module we've seen:
- A
null_resourcewith alocal_execthat uses curl - Use of the HTTP provider to wait for the cluster to be ready
- Management of the aws-auth configmap removed completely
- Management of the aws-auth configmap brought back, but without any "wait_for_cluster_ready" mechanism
Versions
-
Module version [Required]: v19.11.0
-
Terraform version: v1.3.9
-
Provider version(s):
+ provider registry.terraform.io/hashicorp/aws v4.59.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.18.1
+ provider registry.terraform.io/hashicorp/time v0.9.1
+ provider registry.terraform.io/hashicorp/tls v4.0.4
Reproduction Code [Required]
I haven't had time to create a minimal example. I can go do so if necessary. I'm hoping the issue is self-evident here as this is something that has been a common issue for as long as Terraform has been used to deploy EKS clusters.
Steps to reproduce the behavior:
Deploy this module with the public k8s endpoint disabled and with self-managed nodegroups only. Set both create_aws_auth_configmap and manage_aws_auth_configmap to true.
Expected behavior
The aws-auth configmap is always created successfully
Actual behavior
Sometimes a failure occurs due to the cluster not being ready to accept connections yet
Terminal Output Screenshot(s)
Additional context
We are previous users of https://github.com/aws-ia/terraform-aws-eks-blueprints but given the new direction that repo is going we have switched over to using this module directly, however due to this issue things are broken for our use case due to this race condition.
It's pretty hacky but we wrote up this and it seems to work. Still testing. Still would prefer for the module to handle this itself rather than backdooring it.
provider "kubernetes" {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "/bin/sh"
args = ["-c", "for i in $(seq 1 30); do curl -s -k -f ${module.eks.cluster_endpoint}/healthz > /dev/null && break || sleep 10; done && aws eks --region ${var.region} get-token --cluster-name ${local.cluster_name}"]
}
}
The module does not package providers, users need to supply the provider
correct, we have the above code in our root module, next to our module "eks" block
So what are you proposing that the module should handle?
It should handle checking that the EKS cluster is ready to accept connections before it tries to do stuff with it. the above code is an ugly hack to get around the fact that the module doesn't check before trying to create the aws-auth configmap
And how would we do that in Terraform?
This very module used to do it 2 different ways. Most recently it did it using the HTTP Provider: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/v17.24.0/data.tf#L92
Or by using a null_resource: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/v15.2.0/cluster.tf#L67
EKS Blueprints uses the HTTP provider: https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/8fa9a62d6e08afc0be1467c601796e4ddf73a2b2/data.tf#L10
CloudPosse uses a null_resource: https://github.com/cloudposse/terraform-aws-eks-cluster/blob/fa9667a1f63e5f4140c0e964bbf9e5bee0a43215/auth.tf#L63
None of the options are perfect, they all have one issue or another, but I would certainly argue that doing any of them is better than doing nothing, which results in users being confused and frustrated when their stuff doesn't work and they don't know why
Got it. Let's chat about this again in about a month π¬
it looks like more maintainers are needed π¬
What makes you say that?
the month wait you mentioned earlier
Have optimism, empathy. Perhaps that's not a month of waiting on me, maybe we are waiting for better things that are near
can you expand on that? are you thinking on implementing another solution?
This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days
Got it. Let's chat about this again in about a month π¬
Any update?
@bryantbiggs Is this what we were waiting for? π https://github.com/hashicorp/terraform-provider-http/releases/tag/v3.3.0
The timeline has slipped a bit but the changes are coming soon - stay tuned
What changes?
just stay tuned - its proper changes, not hacky changes
well I guess it was more than a month.
I don't love being kept in the dark. Is there any nugget of insight you can give as to what is coming?
This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days
Unstale
This possibly explains why I am getting the below error when setting create_aws_auth_configmap = true
"msg": "\nError: Post \"https://xxxxxxxxxxxxx.gr7.eu-west-1.eks.amazonaws.com/api/v1/namespaces/kube-system/configmaps\": dial tcp 100.68.84.210:443: i/o timeout\n\n with module.eks.kubernetes_config_map.aws_auth[0],\n on .terraform/modules/eks/main.tf line 536, in resource \"kubernetes_config_map\" \"aws_auth\":\n 536: resource \"kubernetes_config_map\" \"aws_auth\" {",
"rc": 1,
Seems like this issue is in some ways related to this one too :
https://github.com/terraform-aws-modules/terraform-aws-eks/issues/2525#issuecomment-1623720769
I just upgraded terraform-aws-eks from 18.x.x to 19.15.3 and am now running into the i/o timeout issue when trying to run kubectl commands (and hitting /api/v1/namespaces/kube-system/configmaps/aws-auth via the public endpoint).
From my perspective, there was an undocumented breaking change made in 19.x.x by changing the default of cluster_endpoint_public_access from true to false.
Yeah, in my case, I was able to find a resolution by going into the AWS console, to my EKS cluster, and updating the networking to allow public access to the managed Kubernetes API server.
This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days
Unstale
Same issue met. Any update?
this will be resolved once https://github.com/aws/containers-roadmap/issues/185#issuecomment-1569048009 arrives in EKS