aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

RevokeSecurityGroupIngress responsible for SG inbound rules getting deleted

Open zeevMetz opened this issue 3 years ago • 10 comments

We're running our AWS-LB-Controller (using NLB for FTP service with few passive ports and one active port). Now, we encounter an issue where Security Group inbound rules are deleted (SG rules with elbv2.k8s.aws/targetGroupBinding=shared description). Looking at CloudTrail we see that ec2:RevokeSecurityGroupIngress in our IAM role is responsible for that, we removed it from our IAM role but then for some reason when pods restarted, we're getting unhealthy health-check and looking at the aws-lb-controller logs we see

{"level":"error","ts":1648057864.807936,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-ferefact-ftplet-090bfc06bc","namespace":"fe-refactor-ftplet","error":"UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: "}

I'm using this IAM role https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/install/iam_policy.json Restating the aws-lb-controller pods in kube-system fix it

Expected outcome SG inbound rules shouldn't be deleted

  • AWS Load Balancer controller version = v2.3.1
  • Kubernetes version = v1.21.5
  • Using EKS (yes/no), if so version? yes
  • Helm Chart version = 1.3.2

zeevMetz avatar Mar 23 '22 21:03 zeevMetz

@zeevMetz, the aws lb controller requires the IAM permission to manage security group rules for traffic. If you remove the iam permission, controller will continue to log error messages, restarting controller will not fix.

The controller removes the security group ingress rules when it determines the rules are no longer necessary for your service. Did you delete the services?

kishorj avatar Mar 23 '22 22:03 kishorj

@kishorj Thanks for your response. When adding ec2:RevokeSecurityGroupIngress back to IAM role, at some point the health-check is unhealthy (the service itself is up), looking at the SG inbound rules of that node I do see that the rules added by the aws lb controller are missing which causes the health-check to become unhealthy. any idea what causing that?

zeevMetz avatar Mar 24 '22 07:03 zeevMetz

The controller automatically adds the necessary rules for traffic access and health check. You could force a reconcile by either modifying the service or restarting the controller - see if the controller configures the SG rules. The SG rules with the comment elbv2.k8s.aws/targetGroupBinding=shared should be managed only by the controller.

kishorj avatar Mar 24 '22 16:03 kishorj

@kishorj Yes, I do see elbv2.k8s.aws/targetGroupBinding=shared tag with the correct port in our SG inbound rule. Screen Shot 2022-03-24 at 19 34 12

But as I mentioned earlier, at some point the controller for some reason revoking those rules (the service itself is still running) and the health checks are obviously in an unhealthy status. Any thoughts on what might cause revoking those rules while the service is still up?

BTW, I upgraded the application version to v2.4.1 (helm chart v1.4.1) and I still see this issue occurs

zeevMetz avatar Mar 24 '22 17:03 zeevMetz

do you happen to run multiple instances of the controller in the same cluster, or share the security group across multiple clusters in the same vpc?

kishorj avatar Mar 24 '22 20:03 kishorj

@kishorj yes I'm running with 2 replics of the controller in kube-system namespace. The security group is shared with all our worker nodes in one cluster

zeevMetz avatar Mar 27 '22 12:03 zeevMetz

@zeevMetz When you say 2 instance, do you mean two deployment? We only support one deployment of the controller, which can have 2 or more Pod under leadership-election.

If you have multiple deployment, then the controllers will conflicts each other, since each one will reconcile the worker SG rules based on its own set of Ingresses/Services.

We have a on-going feature request to support multiple deployment.

M00nF1sh avatar Mar 30 '22 22:03 M00nF1sh

@M00nF1sh Sorry for misleading you, I meant 2 replicas and one deployment (edit the thread)

zeevMetz avatar Mar 31 '22 06:03 zeevMetz

@zeevMetz, would you be able to email the controller logs to k8s-alb-controller-triage AT amazon.com? Also check in your cloud trail for the RevokeSecurityGroupIngress call to check the user-agent of the component making the call.

kishorj avatar Apr 20 '22 22:04 kishorj

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 19 '22 22:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 18 '22 23:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Sep 17 '22 23:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 17 '22 23:09 k8s-ci-robot

I'm seeing this exact same issue using 2.6.0 and 2.61.

j-bruce avatar Mar 14 '24 22:03 j-bruce