aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard
RevokeSecurityGroupIngress responsible for SG inbound rules getting deleted
We're running our AWS-LB-Controller (using NLB for FTP service with few passive ports and one active port). Now, we encounter an issue where Security Group inbound rules are deleted (SG rules with elbv2.k8s.aws/targetGroupBinding=shared description). Looking at CloudTrail we see that ec2:RevokeSecurityGroupIngress in our IAM role is responsible for that, we removed it from our IAM role but then for some reason when pods restarted, we're getting unhealthy health-check and looking at the aws-lb-controller logs we see
{"level":"error","ts":1648057864.807936,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-ferefact-ftplet-090bfc06bc","namespace":"fe-refactor-ftplet","error":"UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: "}
I'm using this IAM role https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/install/iam_policy.json
Restating the aws-lb-controller pods in kube-system fix it
Expected outcome SG inbound rules shouldn't be deleted
- AWS Load Balancer controller version = v2.3.1
- Kubernetes version = v1.21.5
- Using EKS (yes/no), if so version? yes
- Helm Chart version = 1.3.2
@zeevMetz, the aws lb controller requires the IAM permission to manage security group rules for traffic. If you remove the iam permission, controller will continue to log error messages, restarting controller will not fix.
The controller removes the security group ingress rules when it determines the rules are no longer necessary for your service. Did you delete the services?
@kishorj Thanks for your response.
When adding ec2:RevokeSecurityGroupIngress back to IAM role, at some point the health-check is unhealthy (the service itself is up), looking at the SG inbound rules of that node I do see that the rules added by the aws lb controller are missing which causes the health-check to become unhealthy.
any idea what causing that?
The controller automatically adds the necessary rules for traffic access and health check. You could force a reconcile by either modifying the service or restarting the controller - see if the controller configures the SG rules. The SG rules with the comment elbv2.k8s.aws/targetGroupBinding=shared should be managed only by the controller.
@kishorj Yes, I do see elbv2.k8s.aws/targetGroupBinding=shared tag with the correct port in our SG inbound rule.

But as I mentioned earlier, at some point the controller for some reason revoking those rules (the service itself is still running) and the health checks are obviously in an unhealthy status. Any thoughts on what might cause revoking those rules while the service is still up?
BTW, I upgraded the application version to v2.4.1 (helm chart v1.4.1) and I still see this issue occurs
do you happen to run multiple instances of the controller in the same cluster, or share the security group across multiple clusters in the same vpc?
@kishorj yes I'm running with 2 replics of the controller in kube-system namespace. The security group is shared with all our worker nodes in one cluster
@zeevMetz When you say 2 instance, do you mean two deployment? We only support one deployment of the controller, which can have 2 or more Pod under leadership-election.
If you have multiple deployment, then the controllers will conflicts each other, since each one will reconcile the worker SG rules based on its own set of Ingresses/Services.
We have a on-going feature request to support multiple deployment.
@M00nF1sh Sorry for misleading you, I meant 2 replicas and one deployment (edit the thread)
@zeevMetz, would you be able to email the controller logs to k8s-alb-controller-triage AT amazon.com? Also check in your cloud trail for the RevokeSecurityGroupIngress call to check the user-agent of the component making the call.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen - Mark this issue or PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen- Mark this issue or PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I'm seeing this exact same issue using 2.6.0 and 2.61.