aws-load-balancer-controller RevokeSecurityGroupIngress responsible for SG inbound rules getting deleted

We're running our AWS-LB-Controller (using NLB for FTP service with few passive ports and one active port). Now, we encounter an issue where Security Group inbound rules are deleted (SG rules with elbv2.k8s.aws/targetGroupBinding=shared description). Looking at CloudTrail we see that ec2:RevokeSecurityGroupIngress in our IAM role is responsible for that, we removed it from our IAM role but then for some reason when pods restarted, we're getting unhealthy health-check and looking at the aws-lb-controller logs we see

{"level":"error","ts":1648057864.807936,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-ferefact-ftplet-090bfc06bc","namespace":"fe-refactor-ftplet","error":"UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: "}

I'm using this IAM role https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/install/iam_policy.json Restating the aws-lb-controller pods in kube-system fix it

Expected outcome SG inbound rules shouldn't be deleted

AWS Load Balancer controller version = v2.3.1
Kubernetes version = v1.21.5
Using EKS (yes/no), if so version? yes
Helm Chart version = 1.3.2

Mar 23 '22 21:03 zeevMetz

@zeevMetz, the aws lb controller requires the IAM permission to manage security group rules for traffic. If you remove the iam permission, controller will continue to log error messages, restarting controller will not fix.

The controller removes the security group ingress rules when it determines the rules are no longer necessary for your service. Did you delete the services?

Mar 23 '22 22:03 kishorj

@kishorj Thanks for your response. When adding ec2:RevokeSecurityGroupIngress back to IAM role, at some point the health-check is unhealthy (the service itself is up), looking at the SG inbound rules of that node I do see that the rules added by the aws lb controller are missing which causes the health-check to become unhealthy. any idea what causing that?

Mar 24 '22 07:03 zeevMetz

The controller automatically adds the necessary rules for traffic access and health check. You could force a reconcile by either modifying the service or restarting the controller - see if the controller configures the SG rules. The SG rules with the comment elbv2.k8s.aws/targetGroupBinding=shared should be managed only by the controller.

Mar 24 '22 16:03 kishorj

@kishorj Yes, I do see elbv2.k8s.aws/targetGroupBinding=shared tag with the correct port in our SG inbound rule. Screen Shot 2022-03-24 at 19 34 12

But as I mentioned earlier, at some point the controller for some reason revoking those rules (the service itself is still running) and the health checks are obviously in an unhealthy status. Any thoughts on what might cause revoking those rules while the service is still up?

BTW, I upgraded the application version to v2.4.1 (helm chart v1.4.1) and I still see this issue occurs

Mar 24 '22 17:03 zeevMetz

do you happen to run multiple instances of the controller in the same cluster, or share the security group across multiple clusters in the same vpc?

Mar 24 '22 20:03 kishorj

@kishorj yes I'm running with 2 replics of the controller in kube-system namespace. The security group is shared with all our worker nodes in one cluster

Mar 27 '22 12:03 zeevMetz

@zeevMetz When you say 2 instance, do you mean two deployment? We only support one deployment of the controller, which can have 2 or more Pod under leadership-election.

If you have multiple deployment, then the controllers will conflicts each other, since each one will reconcile the worker SG rules based on its own set of Ingresses/Services.

We have a on-going feature request to support multiple deployment.

Mar 30 '22 22:03 M00nF1sh

@M00nF1sh Sorry for misleading you, I meant 2 replicas and one deployment (edit the thread)

Mar 31 '22 06:03 zeevMetz

@zeevMetz, would you be able to email the controller logs to k8s-alb-controller-triage AT amazon.com? Also check in your cloud trail for the RevokeSecurityGroupIngress call to check the user-agent of the component making the call.

Apr 20 '22 22:04 kishorj

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 19 '22 22:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 18 '22 23:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Sep 17 '22 23:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 17 '22 23:09 k8s-ci-robot

I'm seeing this exact same issue using 2.6.0 and 2.61.

Mar 14 '24 22:03 j-bruce

aws-load-balancer-controller aws-load-balancer-controller copied to clipboard

RevokeSecurityGroupIngress responsible for SG inbound rules getting deleted

aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard