aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

LB controller spontaneously loses permission to add tags

Open cpetestewart opened this issue 9 months ago • 21 comments

Describe the bug Recently, out of the blue, we started getting an error from the load balancer controller in our EKS clusters across several accounts. The specific error was that it did not have permission elasticloadbalancing:AddTags. Nothing had changed on our side. We did not upgrade the controller nor change the IAM role.

We traced the error to this clause in the IAM permissions:

            "Effect": "Allow",
            "Action": [
                "elasticloadbalancing:AddTags",
                "elasticloadbalancing:RemoveTags"
            ],
            "Resource": [
                "arn:aws:elasticloadbalancing:*:*:targetgroup/*/*",
                "arn:aws:elasticloadbalancing:*:*:loadbalancer/net/*/*",
                "arn:aws:elasticloadbalancing:*:*:loadbalancer/app/*/*"
            ],
            "Condition": {
                "Null": {
                    "aws:RequestTag/elbv2.k8s.aws/cluster": "true",
                    "aws:ResourceTag/elbv2.k8s.aws/cluster": "false"
                }
            }
        },

The only thing that fixed this was removing the "Condition" clause. Then the controller operated as normally.

This may not be an issue with the LB controller, but before I go to AWS support with this, anyone have any clue as to what is causing this? Note that this is exactly what is currently in the LB controller repo. We have not changed this ever since it was first installed.

Steps to reproduce We're not sure. As indicated above, this started happening out of the blue.

Expected outcome I expect that if I make zero changes to the cluster, the controller deployment, and the IAM role that everything will continue to function as before.

Environment

  • AWS Load Balancer controller version: 2.4.5
  • Kubernetes version: 1.23
  • Using EKS (yes/no), if so version? yes, 1.23

Additional Context:

cpetestewart avatar Sep 12 '23 14:09 cpetestewart

pretty sure we ran into this today , still reviewing

no changes to cluster/perms/etc, suddently ingress creation fails for perms. last worked like a week ago

somewhat assume aws side changes

mnort avatar Sep 12 '23 16:09 mnort

We also ran into this yesterday, I assume it is related to this issue https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2692

Is that condition block blocking the controller from adding the tags that the condition requires?

KlausVii avatar Sep 12 '23 17:09 KlausVii

Same issue here with an (albeit EOL) cluster v1.22. No changes whatsoever to infra, won't work anymore unless replacing the condition block as shown in #2692 .

ghost avatar Sep 13 '23 13:09 ghost

We're hitting this on 2.4.4 on EKS 1.21 since September 8th, 19:00:00 UTC-0.

EDIT: we're hitting this in some 1.21 clusters, not all of them.

LCaparelli avatar Sep 13 '23 14:09 LCaparelli

Ran into this using eks module terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks version 5.2.0 upgraded to 5.30.0 and correct iam policy resolved.

elebiodaslingshot avatar Sep 13 '23 15:09 elebiodaslingshot

Hi @cpetestewart, @elebiodaslingshot, @LCaparelli, @MichielVanDerWinden-inQdo, @KlausVii, @mnort, thanks for reaching out. This issue is related to a recent change in the AWS ELB api call - from 8/30/2023, the 'Create*' API call will fail and return an error if there's no access to elasticloadbalancing:AddTags. Not from aws load balancer controller side.

We have updated our IAM template to address this issue since v2.4.7. Can you please check that your iam policy is updated with this block: https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/install/iam_policy.json#L202. You can check more info in our release note: https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.4.7 related issues: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2692

oliviassss avatar Sep 13 '23 17:09 oliviassss

For context - This change is made in ELB Create* APIs to add an additional layer of security where API callers such as AWS LoadBalancerController must have explicit access to add tags in their Identity and Access Management (IAM) policy [1]. Previously, access to attach tags was implicitly granted with access to Create* APIs.

[1] https://docs.aws.amazon.com/IAM/latest/UserGuide/access_tags.html

rtripat avatar Sep 13 '23 18:09 rtripat

This comment points to the solution and the doc was updated here I guess we just need to keep looking very close to the doc for changes/releases frequently 😸

I added that statement to the controller policy and that fixed the issue for me.

andresvia avatar Sep 14 '23 00:09 andresvia

This issue also appears when the AWS resources the controller is trying to reconcile no longer exist. It's a confusing error because it sends you on an IAM policy goose chase in this case.

blakebarnett avatar Sep 14 '23 18:09 blakebarnett

We have 0 infra changes and bumped into this issue 2 days ago. Our services were impacted with this unexpected changes.

This issue is related to a recent change in the AWS ELB api call - from 8/30/2023

@oliviassss Do you have a reference link for above ^^^ that I can explain the changes to my team? Thank you.

To add more data point here, we've deployed on 09/06/2023 without any issues. It doesn't make sense to me if the API changes from 8/30/2023.

imZack avatar Sep 15 '23 07:09 imZack

@imZack, since AWS load balancer controller calls createLoadBalancer and createTargetGroup APIs, even there's 0 change from our side, if the API changes, we will be affected. I don't think there is public link regarding this security change unfortunately, but I have communicated with our ELB team internally, they would be able to help better clarify this issue soon.

oliviassss avatar Sep 15 '23 17:09 oliviassss

@oliviassss thank you for your response. Please help to escalate the concern to the ELB team. You can imagine the services without the Load Balancers running leads to lots of problems.

imZack avatar Sep 16 '23 06:09 imZack

Hello @imZack, I am L. Felipe from the Elastic Load Balancing team. As mentioned previously in this thread, this change is expected, and the final part of it occurred during the time period mentioned (September 7 - 12, 2023). This update requires explicit permissions for ELB APIs that include the ability to create tags when creating resources, e.g., tag-on-create APIs. This affects all APIs that can create or manipulate tags; CreateLoadBalancer, CreateTargetGroup, CreateListener and CreateRule. We made this change in June, 2023. As part of the rollout, we identified customers that would potentially be affected by the change, and notified them via the AWS Personal Health Dashboard (PHD). These customers were given additional time to update their systems before the change would be applied to their accounts. By September 12, 2023, all calls modifying or creating tags on ELBs were updated to require explicit permissions. Although we did notify customers which we identified as impacted, we did not include customers who were not using the tag-on-create APIs. AWS takes any change that could break or impact customer workloads seriously, and we try to minimize impact to customers whenever such a change is required. Security is one area we will consider such changes, and this change increases your security by preventing unauthorized use of tags, and bringing tag permissions via ELB APIs in line with tag use across all AWS services. We apologize for any confusion or impact this may have had to you or your applications.

luisfelipess avatar Sep 26 '23 09:09 luisfelipess

Thank you @luisfelipess for the further explanation. We do appreciate the effort that you and your team on the security aspect. I suggest AWS can well-documented these changes somewhere instead of only notifying potentially affected users on the PHD since there are tons of guides, blogs, and notes referring to the wrong usage.

imZack avatar Sep 28 '23 12:09 imZack

@oliviassss The new changes work great, thanks.

@luisfelipess Thanks for the info. Please be advised that my company has 6 accounts that this change affected and not one got notified of the change.

cpetestewart avatar Sep 28 '23 17:09 cpetestewart

Hey @luisfelipess,

Thanks for chiming in on this. However, as @cpetestewart mentioned, I also have 20+ AWS accounts that didn't get any notification whatsoever about the mentioned change. I'm happy to get in touch with the support team and provide some account IDs and usage patterns, as it seems that the approach you're using to identify impacted customers is not entirely accurate.

danvaida avatar Sep 29 '23 08:09 danvaida

Hello, I have a 3 EKS clusters running k8s 1.26.8 and all are running the latest AWS LB controller app verison 2.6.1 installed via blueprints addons. One cluster has the the add tag error when creating a new ingress object.

Upon reviewing the IAM policy attached to the role the Load Balancer controller is assuming, it does not have the statement @andresvia posted a link to a few posts up. Ideally I don't manually add the policy statement as all our env's are automated and will overwrite drift. Should the app version 2.6.1 AWS Load Balancer include this fix?

kevinchiu-mlse avatar Oct 02 '23 18:10 kevinchiu-mlse

@kevinchiu-mlse, hi, we have the updated IAM policy template since v2.4.7. However, when you upgrade the AWS LBC version, the IAM policy does not update automatically, since it's not a managed policy, and we rely on users to update it. Please see the release note: https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.4.7

oliviassss avatar Oct 02 '23 18:10 oliviassss

thanks @oliviassss . I see in the EKS Blueprints Addons 5.0 the policy is updated in and managed in the load balancer controller module, however on clusters still running 4.32.1 or older, the policy is outdated. Worst case is I can attach a custom policy to the LBC role.

kevinchiu-mlse avatar Oct 02 '23 18:10 kevinchiu-mlse

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 29 '24 14:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 28 '24 15:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 29 '24 15:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 29 '24 15:03 k8s-ci-robot