community
community copied to clipboard
bug: IAM Controller keeps re-reconciling Role
Describe the bug
similar issue to #1939 but the issue is under all conditions, even when inlinePolicy is set.
This is a critical issue in our CI/CD since we are monitoring the status of the CR as success criteria which never succeeds!
Steps to reproduce
- create a
kind: Role - enable debug log
- monitor the
status.conditions[*].statusof the CR which will stayFalseforever
Expected outcome
The status field to eventually turn True
Environment
- Kubernetes version: 1.26
- Using EKS (yes/no), if so version? yes 1.26
- AWS service targeted : IAM
apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
name: aws-load-balancer-webhook
namespace: kube-system
spec:
assumeRolePolicyDocument: |
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::REDUCTED:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/REDUCTED"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-east-2.amazonaws.com/id/REDUCTED:aud": "sts.amazonaws.com",
"oidc.eks.us-east-2.amazonaws.com/id/REDUCTED:sub": "system:serviceaccount:kube-system:aws-load-balancer-controller"
}
}
}
]
}
inlinePolicies:
eks-int-REDUCTED-aws-load-balancer-webhook: |
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"iam:CreateServiceLinkedRole"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:AWSServiceName": "elasticloadbalancing.amazonaws.com"
}
}
},
{
"Effect": "Allow",
"Action": [
"ec2:Describe*",
"ec2:*Tags",
"ec2:GetCoipPoolUsage",
"ec2:DescribeCoipPools",
"elasticloadbalancing:*",
"cognito-idp:DescribeUserPoolClient",
"acm:ListCertificates",
"acm:DescribeCertificate",
"iam:ListServerCertificates",
"iam:GetServerCertificate",
"waf-regional:GetWebACL",
"waf-regional:GetWebACLForResource",
"waf-regional:AssociateWebACL",
"waf-regional:DisassociateWebACL",
"wafv2:GetWebACL",
"wafv2:GetWebACLForResource",
"wafv2:AssociateWebACL",
"wafv2:DisassociateWebACL",
"shield:GetSubscriptionState",
"shield:DescribeProtection",
"shield:CreateProtection",
"shield:DeleteProtection",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CreateSecurityGroup",
"ec2:RevokeSecurityGroupIngress"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:DeleteSecurityGroup"
],
"Resource": "*",
"Condition": {
"Null": {
"aws:ResourceTag/kubernetes.io/cluster/eks-int-REDUCTED": "false"
}
}
},
{
"Effect": "Allow",
"Action": [
"ec2:DeleteSecurityGroup"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"ec2:ResourceTag/elbv2.k8s.aws/cluster": "eks-int-REDUCTED"
}
}
}
]
}
maxSessionDuration: 3600
name: eks-int-REDUCTED@kube-system_aws-load-balancer-webhook
path: /
tags:
- key: project
value: eks
- key: stage
value: integration
- key: owner
value: eks
status:
conditions:
- lastTransitionTime: "2024-01-24T11:18:58Z"
message: Late initialization did not complete, requeuing with delay of 5 seconds
reason: Delayed Late Initialization
status: "False"
type: ACK.LateInitialized
- lastTransitionTime: "2024-01-24T11:18:58Z"
status: "False"
type: ACK.ResourceSynced
FYI @a-hilaly
I've seen this too, weirdly adding a description solved this for me.
apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
creationTimestamp: "2024-01-24T12:22:54Z"
finalizers:
- finalizers.iam.services.k8s.aws/Role
generation: 3
name: test
namespace: registry
resourceVersion: "1150159403"
uid: 48221f29-829f-48a3-9aa6-ca091c9eedb8
spec:
assumeRolePolicyDocument: |-
{
redacted
}
inlinePolicies:
admin: |-
{
redacted
}
maxSessionDuration: 3600
name: test-role-create
path: /
status:
ackResourceMetadata:
arn: arn:aws:iam::111111111111:role/test-role-create
ownerAccountID: "111111111111"
region: eu-west-1
conditions:
- lastTransitionTime: "2024-01-24T13:06:45Z"
message: Late initialization did not complete, requeuing with delay of 5 seconds
reason: Delayed Late Initialization
status: "False"
type: ACK.LateInitialized
- lastTransitionTime: "2024-01-24T13:06:45Z"
status: "False"
type: ACK.ResourceSynced
createDate: "2024-01-24T12:22:55Z"
roleID: redacted
roleLastUsed: {}
kubectl patch role.iam.services.k8s.aws test -p '{"spec":{"description":"test"}}' --type=merge
apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
creationTimestamp: "2024-01-24T12:22:54Z"
finalizers:
- finalizers.iam.services.k8s.aws/Role
generation: 4
name: test
namespace: registry
resourceVersion: "1150161841"
uid: 48221f29-829f-48a3-9aa6-ca091c9eedb8
spec:
assumeRolePolicyDocument: |-
{
redacted
}
description: test
inlinePolicies:
admin: |-
{
redacted
}
maxSessionDuration: 3600
name: test-role-create
path: /
status:
ackResourceMetadata:
arn: arn:aws:iam::111111111111:role/test-role-create
ownerAccountID: "111111111111"
region: eu-west-1
conditions:
- lastTransitionTime: "2024-01-24T13:08:37Z"
message: Late initialization successful
reason: Late initialization successful
status: "True"
type: ACK.LateInitialized
- lastTransitionTime: "2024-01-24T13:08:37Z"
message: Resource synced successfully
reason: ""
status: "True"
type: ACK.ResourceSynced
createDate: "2024-01-24T12:22:55Z"
roleID: redacted
roleLastUsed: {}
Maybe it's a mismatch between the nil value in the manifest vs an empty value when retrieved?
@universam1 as @matt-simons mentioned setting the description to any non-nil string should resolve the issue.. This is an unfortunate weird behaviour of the IAM API. We definitely can hack something in the code-gen and fix the behaviour on ACK side.
Thank you @matt-simons @a-hilaly for that trick - I would never guess that! 😎
TBH I think we need a workaround at least, we might not be able to train all devs to be aware of this hack.
I'm iterating on few controllers this and next week, i'll make sure to include a fix for this.
Thank you @matt-simons @a-hilaly for that trick - I would never guess that! 😎
TBH I think we need a workaround at least, we might not be able to train all devs to be aware of this hack.
@universam1 Perhaps you could amend the CRD to add a defaulting value for this field?
...
description:
default: ""
description: A description of the role.
type: string
This is now fixed in iam-controller v1.3.6 - the controller now correctly handles the Description field for Roles and Policies, preventing an infinite requeue caused by missing Description field in Create calls. cc @universam1 @matt-simons
Thank you @a-hilaly for the effort!