terraform-aws-eks
terraform-aws-eks copied to clipboard
After update to v20 with API_AND_CONFIG_MAP cluster cannot launch Fargate pods
Description
After updating cluster from v19 to v20 with switching to API_AND_CONFIG_MAP
auth mode, cluster cannot launch new Fargate pods.
New clusters with API_AND_CONFIG_MAP
mode cannot launch Fargate as well.
Versions
- Module version [Required]: 20.2.0
- Terraform version:
Terraform v1.5.7
on darwin_arm64
- Provider version(s):
+ provider registry.terraform.io/hashicorp/aws v5.35.0
+ provider registry.terraform.io/hashicorp/cloudinit v2.3.3
+ provider registry.terraform.io/hashicorp/kubernetes v2.25.2
+ provider registry.terraform.io/hashicorp/time v0.10.0
+ provider registry.terraform.io/hashicorp/tls v4.0.5
Reproduction Code [Required]
Basic stripped down version of what we're using:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.2.0"
cluster_name = "test"
cluster_version = "1.28"
vpc_id = <vpc_id>
subnet_ids = <subnet_ids>
enable_irsa = true
cluster_endpoint_private_access = true
cluster_endpoint_public_access = false
iam_role_use_name_prefix = true
enable_cluster_creator_admin_permissions = true
fargate_profiles = {
kube-system = {
name = "kube-system"
selectors = [
{ namespace = "kube-system" }
]
}
}
}
Expected behavior
Fargate pods should be able to launch
Actual behavior
After update to v20 we're seeing following errors when trying to launch new pods:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 32m fargate-scheduler Misconfigured Fargate Profile: fargate profile kube-system blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods.
Warning FailedScheduling 27m fargate-scheduler Misconfigured Fargate Profile: fargate profile kube-system blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods.
Additional context
From what I see, there is access entry created for Fargate, but no aws-auth ConfigMap entry. While that's probably expected, maybe it affects the ability to run Fargate?
what steps did you follow when upgrading from v19 to v20?
Hey, we saw this when upgrading our clusters to use the EKS API. We re-created all of our Karpenter Fargate Profiles and this solved this issue for us.
Might be worth a try as recreating the Fargate Profile will not cause a loss of nodes (only a short window of no Autoscaling).
what steps did you follow when upgrading from v19 to v20?
@bryantbiggs doesn't really matters tbh, since issue reproduces on new cluster created from scratch with v20 module
Might be worth a try as recreating the Fargate Profile will not cause a loss of nodes (only a short window of no Autoscaling).
Tried this, after Fargate profile recreation entries were added by EKS itself to aws-auth
configmap and Fargate started working. But the next apply shows diff to remove these entried from configmap:
# module.cluster.module.auth.kubernetes_config_map_v1_data.aws_auth[0] will be updated in-place
~ resource "kubernetes_config_map_v1_data" "aws_auth" {
~ data = {
~ "mapRoles" = <<-EOT
- - groups:
- - system:bootstrappers
- - system:nodes
- rolearn: arn:aws:iam::1111111111111:role/Karpenter-test
- username: system:node:{{EC2PrivateDNSName}}
- - groups:
- - system:bootstrappers
- - system:nodes
- - system:node-proxier
- rolearn: arn:aws:iam::1111111111111:role/kube-system-20240208073339112200000002
- username: system:node:{{SessionName}}
+ - "groups":
+ - "system:bootstrappers"
+ - "system:nodes"
+ "rolearn": "arn:aws:iam::1111111111111:role/Karpenter-test"
+ "username": "system:node:{{EC2PrivateDNSName}}"
EOT
# (2 unchanged elements hidden)
}
Fargate profile works so far (after removal entries from aws-auth
configmap), but not sure if that's permanent solution.
Update: after some time after deletion entries from configmap, Fargate profile stopped working again:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7s fargate-scheduler Misconfigured Fargate Profile: fargate profile kube-system blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods.
@bryantbiggs doesn't really matters tbh, since issue reproduces on new cluster created from scratch with v20 module
@dmitriishaburov are you saying that creating a brand new cluster using the latest v20 module and EKS Fargate Profiles with authentication_mode = "API_AND_CONFIG_MAP"
, the pods running on Fargate nodes fail to launch or launch but then later fail?
@dmitriishaburov are you saying that creating a brand new cluster using the latest v20 module and EKS Fargate Profiles with
authentication_mode = "API_AND_CONFIG_MAP"
, the pods running on Fargate nodes fail to launch or launch but then later fail?
Yes, after creating a brand new cluster with v20 module and Fargate, pods initially launch, but later fail (existing pods keep running, but cannot launch any new pod)
Ok thank you - let me dig into this
@dmitriishaburov do you have a way to reproduce? I launched the Fargate example that we have in this module and scaled the sample deployment and still not seeing any issues so far:
k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default inflate-75d744d4c6-67r5k 1/1 Running 0 11m
default inflate-75d744d4c6-8mn7d 1/1 Running 0 11m
default inflate-75d744d4c6-mlgq7 1/1 Running 0 11m
default inflate-75d744d4c6-nf6lm 1/1 Running 0 11m
default inflate-75d744d4c6-pd8rc 1/1 Running 0 11m
karpenter karpenter-7b9d64546f-96jdn 1/1 Running 0 17m
karpenter karpenter-7b9d64546f-dn6kg 1/1 Running 0 17m
kube-system aws-node-qlv67 2/2 Running 0 11m
kube-system coredns-644f96d56d-5lwzv 1/1 Running 0 22m
kube-system coredns-644f96d56d-tw87p 1/1 Running 0 22m
kube-system kube-proxy-tdmhd 1/1 Running 0 11m
@bryantbiggs have you checked the aws-auth configmap that it doesn't have entries for fargate?
If there's no configmap entries, I'd try to restart any deployment in ~1 hour or so, i.e. kubectl rollout restart deploy coredns -n kube-system
yes, there are configmap entries - these are created by EKS
k get configmap -n kube-system aws-auth -o yaml
apiVersion: v1
data:
mapRoles: |
- groups:
- system:bootstrappers
- system:nodes
- system:node-proxier
rolearn: arn:aws:iam::111111111111:role/kube-system-20240208133840563900000002
username: system:node:{{SessionName}}
- groups:
- system:bootstrappers
- system:nodes
- system:node-proxier
rolearn: arn:aws:iam::111111111111:role/karpenter-20240208133840563500000001
username: system:node:{{SessionName}}
kind: ConfigMap
metadata:
creationTimestamp: "2024-02-08T13:49:11Z"
name: aws-auth
namespace: kube-system
resourceVersion: "1442"
uid: 990a01cc-c9cb-4e5a-a0b5-e278ebfdefce
I've manually deleted the aws-auth
ConfigMap and restarted both the coreDNS and Karpenter deployments and still no signs of auth issues
still no signs of auth issues after an hour. for now I am going to park this, I don't think there is anything module related since I am unable to reproduce
Yeah, seems like it's quite hard to replicate.
I've created one more cluster to replicate, keeping configuration as small as possible, and was trying to restart coredns.
It took around 1,5 hours for Fargate profile to start failing:
First try: Fri Feb 9 10:39:07 EET 2024
Failed to start: Fri Feb 9 11:59:38 EET 2024
Here's entire terraform code for the cluster:
rovider "aws" {
profile = "profile"
region = "eu-central-1"
allowed_account_ids = ["111111111"]
}
data "aws_eks_cluster_auth" "this" {
name = module.eks.cluster_name
}
provider "kubernetes" {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
token = data.aws_eks_cluster_auth.this.token
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.2.0"
cluster_name = "fargate-test"
cluster_version = "1.28"
vpc_id = "vpc-111111111"
subnet_ids = [
"subnet-111111111",
"subnet-111111112",
"subnet-111111113",
]
cluster_encryption_config = {}
cluster_endpoint_private_access = true
cluster_endpoint_public_access = false
enable_cluster_creator_admin_permissions = true
cluster_security_group_additional_rules = {
vpn_access = {
description = "VPN"
protocol = "tcp"
from_port = 443
to_port = 443
cidr_blocks = [
"192.168.0.0/19",
]
type = "ingress"
}
}
fargate_profiles = {
kube-system = {
name = "kube-system"
selectors = [
{ namespace = "kube-system" }
]
}
}
}
module "auth" {
source = "terraform-aws-modules/eks/aws//modules/aws-auth"
version = "20.2.0"
manage_aws_auth_configmap = true
aws_auth_roles = [
{
username = "SomeIAMRole"
rolearn = "arn:aws:iam::111111111:role/SomeIAMRole"
groups = ["system:masters"]
}
]
}
@dmitriishaburov The fargate execution pods used to be added with the aws-auth template file in v19. In v20 it's not there anymore and requires you to pass it forward.
So moving from v19 -> v20 you need to add those to the roles mapping:
module "aws-auth" {
source = "terraform-aws-modules/eks/aws//modules/aws-auth"
version = "~> 20.0"
# aws-auth configmap
create_aws_auth_configmap = false
manage_aws_auth_configmap = true
aws_auth_roles = concat(local.roles, local.nodegroup_roles)
aws_auth_users = concat(local.cluster_users, local.users, local.tf_user)
}
locals {
roles = try([
{
rolearn = module.eks_blueprints_addons[0].karpenter.node_iam_role_arn
username = "system:node:{{EC2PrivateDNSName}}"
groups = [
"system:bootstrappers",
"system:nodes"
]
},
{
rolearn = module.eks[0].module.fargate_profile["karpenter"].fargate_profile_pod_execution_role_arn
username = "system:node:{{SessionName}}"
groups = [
"system:bootstrappers",
"system:nodes",
"system:node-proxier"
]
}
], []
cluster_users = try([
for arn in var.cluster_users :
{
userarn = arn
username = regex("[a-zA-Z0-9-_]+$", arn)
groups = [
"system:masters"
]
}
], [])
}
@bryantbiggs I see you posted the aws-auth configmap that was recreated after being deleted, could you paste the TF code you are using to create that using the aws-auth module? I'm assuming you are passing it the fargate_profiles similar to what I am but probably doing it by for_each (which would be better). Probably need way to document this change a bit more. The old aws-auth template besides fargate_profiles also was doing similar for nodegroups as well so that would be more to add here and should be documented somewhere of what would needed to be added when moving from v19 -> v20 to mimic that exact same configMap as the old template was doing automatically.
The fargate execution pods used to be added with the aws-auth template file in v19. In v20 it's not there anymore and requires you to pass it forward.
This is not true - EKS will create both the aws-auth
ConfigMap entry and cluster access entry when using authentication_mode = "API_AND_CONFIG_MAP"
. However, if you are making any changes to the aws-auth
ConfigMap, its up to you to ensure any entries you require stay in the ConfigMap via the configuration you use. With "API_AND_CONFIG_MAP"
, you do not need to have the Fargate profile's IAM roles added in the ConfigMap because EKS will ensure you have an access entry and this is controlled outside of Terraform.
Probably need way to document this change a bit more. The old aws-auth template besides fargate_profiles also was doing similar for nodegroups as well so that would be more to add here and should be documented somewhere of what would needed to be added when moving from v19 -> v20 to mimic that exact same configMap as the old template was doing automatically.
It is documented, I created an entire replica of the module to make this transition easier https://github.com/clowdhaus/terraform-aws-eks-migrate-v19-to-v20
Unless users are using authentication_mode = "CONFIG_MAP"
, there are no actions users need to take with EKS Fargate profiles and managed nodegroups (once they have migrated to v20, or if they are provisioning new clusters with v20)
you do not need to have the Fargate profile's IAM roles added in the ConfigMap because EKS
Docs are not entirely clear, but it seems like during migration to access entries you shouldn't actually remove Fargate (or managed node group) entries from ConfigMap
https://docs.aws.amazon.com/eks/latest/userguide/migrating-access-entries.html
In v19 configmap entries were created automatically in terraform, in v20 any change to ConfigMap via terraform removes the AWS-created entries from ConfigMap. Probably it would make sense keep behavior same in aws-auth module.
If you remove entries that Amazon EKS created in the ConfigMap, your cluster won't function properly.
we cannot maintain the same functionality because that means we are keeping the Kubernetes provider in the module which we are absolutely not doing
In v19 configmap entries were created automatically in terraform, in v20 any change to ConfigMap via terraform removes the AWS-created entries from ConfigMap. Probably it would make sense keep behavior same in aws-auth module.
This is not true. You need to understand how EKS handles access, as I've stated above.
- Prior to access entry, when you create a managed nodegroup or Fargate profile, EKS will automatically upsert an entry into the
aws-auth
ConfigMap with the IAM role that is used by either compute construct. - With access entry, when you create a managed nodegroup or Fargate profile, if you set
authentication_mode = "API_AND_CONFIG_MAP"
, EKS will again automatically upsert an entry into theaws-auth
ConfigMap with the IAM role that is used by either compute construct AND create an access entry for the IAM role - If using access entry and
authentication_mode = "API"
, EKS will automatically create an access entry for the IAM role for managed nodegroup(s) and Fargate profile(s)
Thats just the EKS portion, thats the behavior of the EKS API, both past and present.
In terms of this module, the aws-auth
ConfigMap was a bit contentious because Terraform does not like sharing ownership (actually, it doesn't share at all) of resources. Ignoring this module, if you defined a kubernetes_config_map
resource, it was very easy for users to overwrite the contents that already existed in the configmap (i.e. - the entries that EKS added for managed nodegroups and Fargate profiles), only after dealing with the conflict resource already exists error. This was so problematic that Hashicorp created a resource that is somewhat abnormal and outside the normal Terraform philosophy, kubernetes_config_map_v1_data
, to allow users to forcefully overwrite a configmap, allowing users to avoid the "resource already exists" errors, but again you still had the issue of wiping the entries that EKS added.
Coming back to this module, we automatically mapped the roles from both managed nodegroups and Fargate profiles created by this module into aws-auth
ConfigMap` entries to ensure users didn't shoot their own foot and remove the entries that EKS added. To users, this was transparent - it seemed like the module was the creator of these entries, but as you can see - its a bit more nuanced.
Finally, we come to the migration from aws-auth
ConfigMap to cluster access entry. We will use the following scenario to better highlight the various components, steps, and interactions:
- Starting with an EKS module using v19 which has
manage_aws_auth_configmap = true
- There is a self-managed nodegroup, EKS managed nodegroup, and Fargate profile in this cluster definintion
- There is an extra entry in the
aws-auth
configmap for an IAM role or user
- First, we will use the steps in the upgrade guide and start off by changing the source of the module from
source = "terraform-aws-modules/eks/aws"
tosource = "[email protected]:clowdhaus/terraform-aws-eks-v20-migrate.git?ref=c356ac8ec211604defaaaad49d27863d1e8a1391"
(remove the version for now since we are using a specific git SHA for this temporary step). This temporary module used to aid in upgrading will allow us to enable cluster access entry without modifying theaws-auth
ConfigMap` - Once we've changed the source we'll do the usual Terraform commands:
-
terraform init -upgrade
-
terraform plan
- check that everything looks kosher, we should see theauthentication_mode = "API_AND_CONFIG_MAP"
- consult the upgrade guide for any other changes that show up in the diff and make changes accordingly (should be quite minimal, only defaults for v20 that are changing or new additions) -
terraform apply
- accept the apply
What is happening in step 2 is that we are enabling cluster access entry but not modifying the aws-auth
ConfigMapas stated in the EKS docs. If you do not specify any additional
access_entries`, this will only cover the self-managed nodegroup, the EKS managed nodegroup and Fargate profile IAM roles and the cluster creator (admin) role. In the background, EKS is creating the access entries for the managed nodegroup and Fargate profile roles, as well as the cluster creator (admin) role. The EKS module is creating the access entry for the self-managed nodegroup.
- The last component we need to cover is the additional
aws-auth
ConfigMap entry for the IAM role or user. If you require custom RBAC permissions, you will need to continue using the ConfigMap route by using the newaws-auth
sub-module. This sub-module is a direct copy of the v19 implementation, but it no longer has any default entries for nodegroups or Fargate profiles - only what users specify. If you can use one of the existing policies, you can instead create an access entry for this IAM role or user and completely remove the use of theaws-auth
ConfigMap. For this scenario, we will only use access entries. - Change the module source back to
source = "terraform-aws-modules/eks/aws"
and set the appropriate v20 version and re-run the same set of commands listed in step 2. When this change is applied, theaws-auth
ConfigMap would be deleted from the cluster due to Terraform. This is fine and expected, this is why we need to ensure access entries exist prior to this happening (or even prior to entries being removed from the ConfigMap).
Just for sake of completeness - if the authentication_mode
stays at "API_AND_CONFIG_MAP"
(which is fine), any changes to IAM roles for the managed nodegroup(s) or Fargate profile(s) in the cluster (updates, additions, etc.) - EKS will continue to automatically upsert entries into the aws-auth
ConfigMap. In the scenario above, you saw that the ConfigMap was entirely removed from the cluster - but any of the described changes will cause the aws-auth
ConfigMapto be re-created by EKS. If you want to avoid this entirely, you can change the
authentication_modeto
"API"and only access entries will be used, and EKS will no longer make any modifications to the
aws-auth` ConfigMap
I'm running into the same issues and reading all of this doesn't clear up what is happening.
I've run the migration from v19 to v20 using the migration fork and the fargate pods are starting correctly. However now that I'm back on the standard eks module source with the version set to ~>20.0 a terraform plan is saying it would like to destroy the module.eks.kubernetes_config_map_v1_data.aws_auth[0]
resource. I did see in the 20.x upgrade documentation that instead of letting terraform destroy that resource it should be removed from the state so no disruptions occur. In my case it wants to remove the karpenter
and the initial-eks-node-group
roles. The 20.x upgrade documentation says it automatically adds access for managed node groups and fargate, which karpenter is running in Fargate. If that statement is true why do we need to leave the resources that were created by module.eks.kubernetes_config_map_v1_data.aws_auth[0]
around?
Any help clarifying this would greatly appreciated.
If that statement is true why do we need to leave the resources that were created by module.eks.kubernetes_config_map_v1_data.aws_auth[0] around?
You only need to move/remove the aws-auth
resources when you are going to have entries that are used in the aws-auth
configmap. If everything is covered by cluster access entries, you do not need to do anything with these resources and simply let Terraform destroy them
I faced the same challenge as well. I identified the issue. as per AWS doc we need to have the following policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Condition": {
"ArnLike": {
"aws:SourceArn": "arn:aws:eks:region-code:111122223333:fargateprofile/my-cluster/*"
}
},
"Principal": {
"Service": "eks-fargate-pods.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
instead of the module is creating only
{
"Version": "2012-10-17",
"Statement": [
"Principal": {
"Service": "eks-fargate-pods.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
once i added the additional trust things started working. I am not sure why it is working on v19 though
Hey @bryantbiggs , similar thing i'm observing and would like to have a clarification before i make the upgrade to avoid any disruption in acess.
So when i'm going for v19 to v20 setting the version to ~>20.0 and having the the auth_roles creation using the sub module as below
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.2.0"
....
}
module "auth" {
source = "terraform-aws-modules/eks/aws//modules/aws-auth"
version = "20.2.0"
manage_aws_auth_configmap = true
aws_auth_roles = [
{
username = "SomeIAMRole"
rolearn = "arn:aws:iam::111111111:role/SomeIAMRole"
groups = ["system:masters"]
}
]
}
the terraform plan says destruction & creation of the config map
# module.eks.kubernetes_config_map_v1_data.aws_auth[0] will be destroyed
# (because kubernetes_config_map_v1_data.aws_auth is not in configuration)
# module.eks-auth-modules.kubernetes_config_map_v1_data.aws_auth[0] will be created
My concern with this delete & create, will i lose the access for sometime? or EKS access entry (which i'm assuming will automatically created) will take care for the disruption? Can this create shooting on the foot scenario?
And is it for this reason we need go with this approach https://github.com/clowdhaus/terraform-aws-eks-migrate-v19-to-v20 and no direct upgrade from v19 to v20?
This is tough to reproduce, but I ran into it as well, in API_AND_CONFIG_MAP
mode. While there is an access entry being created for Fargate profiles, it appears to be missing something, not sure what. I had to re-add the entries to the aws-auth ConfigMap to keep my Fargate profiles working.
I also ran into the same issue and same error message
Steps I followed:
-
eks module version 19.17.1
-
Upgrading to 20.4.0
-
I simply upgraded the version. In my terraform plan i also got the same
module.eks.kubernetes_config_map_v1_data.aws_auth[0] will be destroyed
and after the upgrade immediately rollout command to deploy pod on fargate was working but after an hour it didn’t. -
Note: I didnt specify
authentication_mode = "API_AND_CONFIG_MAP"
. I simply chose to keep the default
that is far from what is outlined in the upgrade guide and I would expected issues when following that route
i just brought up a brand new EKS cluster on 20.5.0 and its having the same issue:
Warning FailedScheduling 21s fargate-scheduler Misconfigured Fargate Profile: fargate profile coredns blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods.
so this has nothing to do with the upgrade
module "eks" {
...
enable_cluster_creator_admin_permissions = true
fargate_profile_defaults = {
iam_role_additional_policies = {
additional = aws_iam_policy.node_additional.arn,
}
tags = {
cluster = local.name
}
timeouts = {
create = "20m"
delete = "20m"
}
}
fargate_profiles = {
karpenter = {
selectors = [
{ namespace = "platform-karpenter" }
]
}
coredns = {
selectors = [
{ namespace = "kube-system", labels = { k8s-app = "kube-dns" } }
]
}
}
}
# bit more indepth policy than the one in the fargate example:
resource "aws_iam_policy" "node_additional" {
name = "${local.name}-additional"
description = "Example usage of node additional policy"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"ec2:Describe*",
]
Effect = "Allow"
Resource = "*"
},
{
Action = [
"kms:Decrypt",
]
Effect = "Allow"
Resource = [
var.session_manager_key
]
},
{
Action = [
"kms:*"
]
Effect = "Allow"
Resource = ["*"]
Condition = {
StringLike = {
"ec2:ResourceTag/Terraform" = "true"
}
}
}
]
})
tags = local.tags
}
looked at the pod execution role for coredns and it has "AmazonEKS_CNI_Policy", "AmazonEKSFargatePodExecutionRolePolicy", "My additional role policy" with a trust of:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "eks-fargate-pods.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
So it looks correct I think. I know above @kuntalkumarbasu said you need to add the condition to make it work but that doesn't seem correct does it?
module "aws-auth" {
source = "terraform-aws-modules/eks/aws//modules/aws-auth"
version = "~> 20.0"
# aws-auth configmap
create_aws_auth_configmap = false
manage_aws_auth_configmap = true
# local.cluster_users is list of arns, local.users is AWS account list of arns, local.tf_user is the role arn creating the terraform apply to add as system:masters - these are to be replaced with access_entries (tf user should already be done by the eks module now).
aws_auth_users = concat(local.cluster_users, local.users, local.tf_user)
}
I used to add the fargate executor arns here before as described in previous reply but since module creates access entry I removed those from being added here as aws_auth_roles
As @bryantbiggs mentioned. I followed the following steps and I am not seeing this error anymore.
-
https://github.com/clowdhaus/terraform-aws-eks-migrate-v19-to-v20
- Update the module to 19.21
- changed the source
- refactored the code
- terraform init -upgrade and apply
- changed the source again
- terraform init -upgrade and apply
-
changed the authentication_mode = API
-
terraform apply
confirmed using kubectl rollout in 2 hours, 24 hours. EKS is able to deploy pods on fargate nodes.
Team, I'm facing the same issue post migrating from v19 to v20. I get the following error as OP on my Karpenter Pending pods
Misconfigured Fargate Profile: fargate profile karpenter blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods
What are some things I could try?
eks-cluster
module "eks_cluster" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.8.2"
create_kms_key = true
kms_key_owners = [
"arn:aws:iam::${local.account_id}:root",
]
kms_key_administrators = [
"arn:aws:iam::${local.account_id}:role/aws-reserved/sso.amazonaws.com/<role>
]
cluster_enabled_log_types = [
"api",
"authenticator"
]
enable_irsa = true
authentication_mode = "API_AND_CONFIG_MAP"
cluster_name = local.name
cluster_version = "1.29"
cluster_endpoint_public_access = true
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
# control_plane_subnet_ids = module.vpc.intra_subnets
# Fargate profiles use the cluster primary security group so these are not utilized
create_cluster_security_group = false
create_node_security_group = false
fargate_profiles = {
karpenter = {
selectors = [
{ namespace = "karpenter" }
]
}
}
tags = merge(local.tags, {
"karpenter.sh/discovery" = local.name
})
node_security_group_additional_rules = {
ingress_self_all = { ... }
egress_all = { ... }
ingress_cluster_to_node_all_traffic = { ... }
}
}
aws-auth
module "eks_cluster_aws_auth" {
source = "terraform-aws-modules/eks/aws//modules/aws-auth"
version = "~> 20.8.2"
manage_aws_auth_configmap = true
aws_auth_roles = flatten([
# We need to add in the Karpenter node IAM role for nodes launched by Karpenter
{
rolearn = module.eks_blueprints_addons.karpenter.node_iam_role_arn
username = "system:node:{{SessionName}}"
groups = [
"system:bootstrappers",
"system:nodes",
"system:node-proxier"
]
},
module.platform.aws_auth_configmap_role,
module.peo.aws_auth_configmap_role,
module.ats.aws_auth_configmap_role,
module.hris_relay.aws_auth_configmap_role,
module.pipelines.aws_auth_configmap_role,
])
}
karpenter
module "eks_cluster_karpenter" {
source = "terraform-aws-modules/eks/aws//modules/karpenter"
version = "~> 20.8.2"
cluster_name = module.eks_cluster.cluster_name
create_access_entry = false
enable_irsa = true
create_instance_profile = true
iam_role_name = "KarpenterIRSA-${module.eks_cluster.cluster_name}"
iam_role_description = "Karpenter IAM role for service account"
iam_policy_name = "KarpenterIRSA-${module.eks_cluster.cluster_name}"
iam_policy_description = "Karpenter IAM role for service account"
irsa_oidc_provider_arn = module.eks_cluster.oidc_provider_arn
tags = merge(local.tags, {})
}
I had the same issue:
Misconfigured Fargate Profile: fargate profile karpenter blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods
The only thing that solved for me was manually adding the fargate pod execution role to aws-auth (using the new submodule) like this:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.2.1"
cluster_name = "my-cluster"
cluster_version = "1.28"
# Fargate profiles use the cluster primary security group so these are not utilized
create_cluster_security_group = false
create_node_security_group = false
fargate_profiles = {
karpenter = {
selectors = [
{ namespace = "karpenter" }
]
}
kube-system = {
selectors = [
{ namespace = "kube-system" }
]
}
}
}
module "eks_auth" {
source = "terraform-aws-modules/eks/aws//modules/aws-auth"
version = "~> 20.2.1"
manage_aws_auth_configmap = true
aws_auth_roles = [
{
rolearn = module.karpenter.node_iam_role_arn
username = "system:node:{{EC2PrivateDNSName}}"
groups = [
"system:bootstrappers",
"system:nodes",
]
},
{
rolearn = module.eks.fargate_profiles.kube-system.fargate_profile_pod_execution_role_arn
username = "system:node:{{SessionName}}"
groups = [
"system:bootstrappers",
"system:nodes",
"system:node-proxier",
]
},
{
rolearn = module.eks.fargate_profiles.karpenter.fargate_profile_pod_execution_role_arn
username = "system:node:{{EC2PrivateDNSName}}"
groups = [
"system:bootstrappers",
"system:nodes",
"system:node-proxier",
]
},
]
}
For some reason, it only worked by using EC2PrivateDNSName
as username, and making sure to also add system:node-proxier
group, though I don't fully understand why.
for those on this issue/thread, can you open an AWS support case with your cluster ARN and the time period when you encountered this behavior, please
We have encountered this issue on all of our ~12 clusters. It is definitely an EKS issue and not a terraform issue since deleting and recreating the fargate profile (either via terraform or the console) fixes it... temporarily. We've opened an AWS ticket for the matter.