community icon indicating copy to clipboard operation
community copied to clipboard

bug: IAM Controller keeps re-reconciling Role

Open universam1 opened this issue 1 year ago • 6 comments

Describe the bug similar issue to #1939 but the issue is under all conditions, even when inlinePolicy is set.

This is a critical issue in our CI/CD since we are monitoring the status of the CR as success criteria which never succeeds!

Steps to reproduce

  • create a kind: Role
  • enable debug log
  • monitor the status.conditions[*].status of the CR which will stay False forever

Expected outcome The status field to eventually turn True

Environment

  • Kubernetes version: 1.26
  • Using EKS (yes/no), if so version? yes 1.26
  • AWS service targeted : IAM
apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
  name: aws-load-balancer-webhook
  namespace: kube-system
spec:
  assumeRolePolicyDocument: |
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Federated": "arn:aws:iam::REDUCTED:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/REDUCTED"
                },
                "Action": "sts:AssumeRoleWithWebIdentity",
                "Condition": {
                    "StringEquals": {
                        "oidc.eks.us-east-2.amazonaws.com/id/REDUCTED:aud": "sts.amazonaws.com",
                        "oidc.eks.us-east-2.amazonaws.com/id/REDUCTED:sub": "system:serviceaccount:kube-system:aws-load-balancer-controller"
                    }
                }
            }
        ]
    }
  inlinePolicies:
    eks-int-REDUCTED-aws-load-balancer-webhook: |
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "iam:CreateServiceLinkedRole"
                  ],
                  "Resource": "*",
                  "Condition": {
                      "StringEquals": {
                          "iam:AWSServiceName": "elasticloadbalancing.amazonaws.com"
                      }
                  }
              },
              {
                  "Effect": "Allow",
                  "Action": [
                      "ec2:Describe*",
                      "ec2:*Tags",
                      "ec2:GetCoipPoolUsage",
                      "ec2:DescribeCoipPools",
                      "elasticloadbalancing:*",
                      "cognito-idp:DescribeUserPoolClient",
                      "acm:ListCertificates",
                      "acm:DescribeCertificate",
                      "iam:ListServerCertificates",
                      "iam:GetServerCertificate",
                      "waf-regional:GetWebACL",
                      "waf-regional:GetWebACLForResource",
                      "waf-regional:AssociateWebACL",
                      "waf-regional:DisassociateWebACL",
                      "wafv2:GetWebACL",
                      "wafv2:GetWebACLForResource",
                      "wafv2:AssociateWebACL",
                      "wafv2:DisassociateWebACL",
                      "shield:GetSubscriptionState",
                      "shield:DescribeProtection",
                      "shield:CreateProtection",
                      "shield:DeleteProtection",
                      "ec2:AuthorizeSecurityGroupIngress",
                      "ec2:CreateSecurityGroup",
                      "ec2:RevokeSecurityGroupIngress"
                  ],
                  "Resource": "*"
              },
              {
                  "Effect": "Allow",
                  "Action": [
                      "ec2:DeleteSecurityGroup"
                  ],
                  "Resource": "*",
                  "Condition": {
                      "Null": {
                          "aws:ResourceTag/kubernetes.io/cluster/eks-int-REDUCTED": "false"
                      }
                  }
              },
              {
                  "Effect": "Allow",
                  "Action": [
                      "ec2:DeleteSecurityGroup"
                  ],
                  "Resource": "*",
                  "Condition": {
                      "StringEquals": {
                          "ec2:ResourceTag/elbv2.k8s.aws/cluster": "eks-int-REDUCTED"
                      }
                  }
              }
          ]
      }
  maxSessionDuration: 3600
  name: eks-int-REDUCTED@kube-system_aws-load-balancer-webhook
  path: /
  tags:
  - key: project
    value: eks
  - key: stage
    value: integration
  - key: owner
    value: eks
status:
  conditions:
  - lastTransitionTime: "2024-01-24T11:18:58Z"
    message: Late initialization did not complete, requeuing with delay of 5 seconds
    reason: Delayed Late Initialization
    status: "False"
    type: ACK.LateInitialized
  - lastTransitionTime: "2024-01-24T11:18:58Z"
    status: "False"
    type: ACK.ResourceSynced

universam1 avatar Jan 24 '24 11:01 universam1

FYI @a-hilaly

universam1 avatar Jan 24 '24 12:01 universam1

I've seen this too, weirdly adding a description solved this for me.

apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
  creationTimestamp: "2024-01-24T12:22:54Z"
  finalizers:
  - finalizers.iam.services.k8s.aws/Role
  generation: 3
  name: test
  namespace: registry
  resourceVersion: "1150159403"
  uid: 48221f29-829f-48a3-9aa6-ca091c9eedb8
spec:
  assumeRolePolicyDocument: |-
    {
        redacted
    }
  inlinePolicies:
    admin: |-
    {
        redacted
    }
  maxSessionDuration: 3600
  name: test-role-create
  path: /
status:
  ackResourceMetadata:
    arn: arn:aws:iam::111111111111:role/test-role-create
    ownerAccountID: "111111111111"
    region: eu-west-1
  conditions:
  - lastTransitionTime: "2024-01-24T13:06:45Z"
    message: Late initialization did not complete, requeuing with delay of 5 seconds
    reason: Delayed Late Initialization
    status: "False"
    type: ACK.LateInitialized
  - lastTransitionTime: "2024-01-24T13:06:45Z"
    status: "False"
    type: ACK.ResourceSynced
  createDate: "2024-01-24T12:22:55Z"
  roleID: redacted
  roleLastUsed: {}

kubectl patch role.iam.services.k8s.aws test -p '{"spec":{"description":"test"}}' --type=merge

apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
  creationTimestamp: "2024-01-24T12:22:54Z"
  finalizers:
  - finalizers.iam.services.k8s.aws/Role
  generation: 4
  name: test
  namespace: registry
  resourceVersion: "1150161841"
  uid: 48221f29-829f-48a3-9aa6-ca091c9eedb8
spec:
  assumeRolePolicyDocument: |-
    {
        redacted
    }
  description: test
  inlinePolicies:
    admin: |-
    {
        redacted
    }
  maxSessionDuration: 3600
  name: test-role-create
  path: /
status:
  ackResourceMetadata:
    arn: arn:aws:iam::111111111111:role/test-role-create
    ownerAccountID: "111111111111"
    region: eu-west-1
  conditions:
  - lastTransitionTime: "2024-01-24T13:08:37Z"
    message: Late initialization successful
    reason: Late initialization successful
    status: "True"
    type: ACK.LateInitialized
  - lastTransitionTime: "2024-01-24T13:08:37Z"
    message: Resource synced successfully
    reason: ""
    status: "True"
    type: ACK.ResourceSynced
  createDate: "2024-01-24T12:22:55Z"
  roleID: redacted
  roleLastUsed: {}

Maybe it's a mismatch between the nil value in the manifest vs an empty value when retrieved?

matt-simons avatar Jan 24 '24 13:01 matt-simons

@universam1 as @matt-simons mentioned setting the description to any non-nil string should resolve the issue.. This is an unfortunate weird behaviour of the IAM API. We definitely can hack something in the code-gen and fix the behaviour on ACK side.

a-hilaly avatar Jan 24 '24 15:01 a-hilaly

Thank you @matt-simons @a-hilaly for that trick - I would never guess that! 😎

TBH I think we need a workaround at least, we might not be able to train all devs to be aware of this hack.

universam1 avatar Jan 24 '24 16:01 universam1

I'm iterating on few controllers this and next week, i'll make sure to include a fix for this.

a-hilaly avatar Jan 24 '24 16:01 a-hilaly

Thank you @matt-simons @a-hilaly for that trick - I would never guess that! 😎

TBH I think we need a workaround at least, we might not be able to train all devs to be aware of this hack.

@universam1 Perhaps you could amend the CRD to add a defaulting value for this field?

...
              description:
                default: ""
                description: A description of the role.
                type: string

matt-simons avatar Jan 24 '24 17:01 matt-simons

This is now fixed in iam-controller v1.3.6 - the controller now correctly handles the Description field for Roles and Policies, preventing an infinite requeue caused by missing Description field in Create calls. cc @universam1 @matt-simons

a-hilaly avatar Mar 12 '24 21:03 a-hilaly

Thank you @a-hilaly for the effort!

universam1 avatar Mar 13 '24 07:03 universam1