cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

Fix leak managed/owned security group on Service update with BYO SG

Open mtulio opened this issue 5 months ago • 33 comments

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a leaked security group (SG) when a Service type-loadBalancer (CLB) is updated adding the BYO SG annotation (service.beta.kubernetes.io/aws-load-balancer-security-groups), which replaces all SG added to the Load Balancer without removing rules and deleting it when created by controller.

Which issue(s) this PR fixes:

Fixes #1208

Special notes for your reviewer:

The approach of creating isolated function was used specially to:

  • enhance code maintenance
  • enhance unit tests
  • allow to reuse the logic when NLB with SG is supported

The unit tests and documentation(function) comments have been assisted by Cursor AI(model claude-4-sonet): AIA HAb SeCeNc Hin R v1.0

Does this PR introduce a user-facing change?:

Fixed security group leak when updating Classic Load Balancer services with `service.beta.kubernetes.io/aws-load-balancer-security-groups` annotation. Controller-managed security groups are now properly cleaned up when switching to user-specified security groups.

mtulio avatar Jul 15 '25 17:07 mtulio

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 15 '25 17:07 k8s-ci-robot

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 15 '25 17:07 k8s-ci-robot

/ok-to-test

elmiko avatar Jul 15 '25 20:07 elmiko

/test all

mtulio avatar Jul 15 '25 21:07 mtulio

Fixing doc strings and failed unit tests from previous unexpected behavior:

/test all

mtulio avatar Jul 15 '25 22:07 mtulio

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 16 '25 02:07 k8s-ci-robot

/test pull-cloud-provider-aws-e2e-kubetest2

mtulio avatar Jul 16 '25 03:07 mtulio

/test all

mtulio avatar Jul 16 '25 03:07 mtulio

I can't find connection between failures in pull-cloud-provider-aws-e2e-kubetest2 and existing changes.

I am going to convert to regular PR to ask for reviewers while we observe if this isnt a CI flake.

PTAL? /assign @kmala @elmiko @JoelSpeed

mtulio avatar Jul 16 '25 14:07 mtulio

This PR is ready for review. PTAL?

mtulio avatar Jul 18 '25 19:07 mtulio

/test pull-cloud-provider-aws-e2e-kubetest2-quick

mtulio avatar Jul 23 '25 02:07 mtulio

Experiencing some instability on CI, failed to build image. Retrying: /test pull-cloud-provider-aws-e2e

mtulio avatar Jul 24 '25 20:07 mtulio

/test pull-cloud-provider-aws-e2e

mtulio avatar Jul 25 '25 15:07 mtulio

Rebase and testing e2e that got stuck on downloading dependencies. Waiting for new results.

mtulio avatar Jul 29 '25 20:07 mtulio

I am going to write a new e2e test case covering the update of from managed to BYO SG on CLB to increase confidence to this PR.

mtulio avatar Aug 01 '25 14:08 mtulio

Introduced a new e2e test case to validate the full workflow of proposed fix.

$ ./e2e.test --ginkgo.v  --ginkgo.focus="CLB with managed Security Group" 
  W0801 16:24:41.940826  487078 test_context.go:478] Unable to find in-cluster config, using default host : https://127.0.0.1:6443
  Aug  1 16:24:41.940: INFO: The --provider flag is not set. Continuing as if --provider=skeleton had been used.
Running Suite: AWS Cloud Provider End-to-End Tests - /home/redacted
=========================================================================================
Random Seed: 1754076281 - will randomize all specs

Will run 1 of 9 specs
SS
------------------------------
[cloud-provider-aws-e2e] loadbalancer CLB with managed Security Group must update to BYO Security Group

[...]

 STEP: executing pre-test hook @ 08/01/25 16:24:56.104
  Aug  1 16:24:56.104: INFO: running hook post-service-config patching service annotation with BYO security group
  Aug  1 16:24:56.104: INFO: Getting load balancer security groups with address a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com
  Aug  1 16:25:03.887: INFO: Load balancer a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com has security groups: [sg-037bc641dfde2e5b6]
  Aug  1 16:25:03.887: INFO: Checking if managed SGs are owned by the controller
  Aug  1 16:25:04.147: INFO: Security group "sg-037bc641dfde2e5b6" is managed: true
  Aug  1 16:25:04.147: INFO: Creating BYO SG with name "cloud-provider-aws-5233-lbconfig-test-clb-sg-sg-byo"
  Aug  1 16:25:04.570: INFO: BYO SG "cloud-provider-aws-5233-lbconfig-test-clb-sg-sg-byo" created with ID "sg-0c7a7bb83076403be"
  Aug  1 16:25:04.570: INFO: Authorizing BYO SG "sg-0c7a7bb83076403be" to service ports: [{Name:port-0 Protocol:TCP AppProtocol:<nil> Port:80 TargetPort:{Type:0 IntVal:8080 StrVal:} NodePort:32297}]
  Aug  1 16:25:04.570: INFO: Authorizing security group sg-0c7a7bb83076403be for 1 ports
  Aug  1 16:25:04.570: INFO: Adding rule: protocol=tcp, port=80, source=0.0.0.0/0
  Aug  1 16:25:04.570: INFO: Calling AWS AuthorizeSecurityGroupIngress for security group sg-0c7a7bb83076403be with 1 rules
  Aug  1 16:25:04.893: INFO: Successfully authorized security group sg-0c7a7bb83076403be.
  Aug  1 16:25:04.893: INFO: BYO SG "sg-0c7a7bb83076403be" authorized to service ports
  Aug  1 16:25:04.893: INFO: Verifying security group rules were created for BYO SG "sg-0c7a7bb83076403be"
  Aug  1 16:25:05.141: INFO: Security group sg-0c7a7bb83076403be has 1 ingress rules:
  Aug  1 16:25:05.141: INFO:   Rule 1: protocol=tcp, ports=80-80, sources=[0.0.0.0/0]
  Aug  1 16:25:05.141: INFO: ✓ Found rule for protocol=tcp port=80 in security group sg-0c7a7bb83076403be
  Aug  1 16:25:05.141: INFO: Patching Service "lbconfig-test-clb-sg" with BYO SG "sg-0c7a7bb83076403be"
  Aug  1 16:25:05.298: INFO: Waiting for load balancer to be updated
  Aug  1 16:25:15.307: INFO: Getting load balancer security groups after update
  Aug  1 16:25:15.533: INFO: Load balancer a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com has security groups: [sg-0c7a7bb83076403be]
  Aug  1 16:25:15.533: INFO: Checking if LB is using BYO SG "sg-0c7a7bb83076403be"
  Aug  1 16:25:15.533: INFO: Load balancer a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com has BYO security group "sg-0c7a7bb83076403be"
  Aug  1 16:25:15.533: INFO: Checking if managed SGs were removed
  Aug  1 16:25:15.533: INFO: Checking if managed SG "sg-037bc641dfde2e5b6" is removed by controller
  Aug  1 16:25:15.775: INFO: Managed security group "sg-037bc641dfde2e5b6" removed
  Aug  1 16:25:15.775: INFO: pre-test hook completed

[...]

  STEP: testing HTTP connectivity from external client @ 08/01/25 16:25:15.775
  Aug  1 16:25:15.775: INFO: [TEST] Running external connectivity test to a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com:80
  Aug  1 16:25:15.775: INFO: Poking "http://a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com:80/echo?msg=hello"
  Aug  1 16:25:15.840: INFO: Poke("http://a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com:80/echo?msg=hello"): Get "http://a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com:80/echo?msg=hello": dial tcp: lookup a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com: no such host

[...]

 Aug  1 16:26:16.183: INFO: [TEST] HTTP connectivity test completed successfully
  STEP: cleaning up: converting service to ClusterIP @ 08/01/25 16:26:16.183
  STEP: cleaning up: waiting for load balancer destruction @ 08/01/25 16:26:16.497
  Aug  1 16:26:16.497: INFO: [CLEANUP] Waiting for load balancer destruction
  Aug  1 16:26:16.497: INFO: Waiting up to 15m0s for service "lbconfig-test-clb-sg" to have no LoadBalancer
  Aug  1 16:26:16.649: INFO: [CLEANUP] Load balancer destroyed successfully
  Aug  1 16:26:16.649: INFO: Cleaning up e2e resources
  Aug  1 16:26:16.649: INFO: Cleaning up security group sg-0c7a7bb83076403be
  Aug  1 16:26:17.683: INFO: Security group sg-0c7a7bb83076403be still has dependencies, waiting... (failed to delete security group sg-0c7a7bb83076403be: operation error EC2: DeleteSecurityGroup, https response error StatusCode: 400, RequestID: d048bc4c-9cb5-474d-b463-5635cd3c46f9, api error DependencyViolation: resource sg-0c7a7bb83076403be has a dependent object)

[...]

  Aug  1 16:26:36.764: INFO: Successfully deleted security group sg-0c7a7bb83076403be
  Aug  1 16:26:36.764: INFO: Successfully cleaned up security group sg-0c7a7bb83076403be
  STEP: Destroying namespace "cloud-provider-aws-5233" for this suite. @ 08/01/25 16:26:36.764
• [114.977 seconds]
------------------------------
SSSSSS

Ran 1 of 9 Specs in 114.977 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 8 Skipped
PASS

/test pull-cloud-provider-aws-e2e

mtulio avatar Aug 01 '25 19:08 mtulio

I also got green results from of all lb tests:

$ ./e2e.test --ginkgo.v  --ginkgo.focus="loadbalancer" 
...
Ran 5 of 9 Specs in 1291.068 seconds
SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 4 Skipped
PASS

mtulio avatar Aug 01 '25 20:08 mtulio

/test pull-cloud-provider-aws-e2e

mtulio avatar Aug 01 '25 20:08 mtulio

/test pull-cloud-provider-aws-e2e

mtulio avatar Aug 01 '25 20:08 mtulio

unrelated failure in the lb-internal tests which expects to fail (hairpinning). Checking if it was flake:

/test pull-cloud-provider-aws-e2e

mtulio avatar Aug 01 '25 22:08 mtulio

Review comments addressed, new e2e added and e2e passing on CI. This PR is ready for review. PTAL? Thanks

mtulio avatar Aug 02 '25 22:08 mtulio

e2e test [cloud-provider-aws-e2e] loadbalancer CLB with managed Security Group must update to BYO Security Group was added, here are the logs of the step which performs the Service update/checks: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1209/pull-cloud-provider-aws-e2e/1952377003214639104#1:build-log.txt%3A2140-2167

Investigating if the failure AWS Cloud Provider End-to-End Tests: [It] [cloud-provider-aws-e2e] loadbalancer CLB internal should be reachable with hairpinning traffic in is related to e2e updates.

converting to draft while increasing debug on internal test of CLB, looks like it's failing to retrieve pod information, checking if this is related to the service account. Once I get more information and isolate the issue I will return to ready.

mtulio avatar Aug 04 '25 20:08 mtulio

/test pull-cloud-provider-aws-e2e

mtulio avatar Aug 04 '25 20:08 mtulio

FAIL: expected managed security group "sg-0a917e4b14501635f" removed by controller, got "sg-0a917e4b14501635f"

Checking if I need to enhance the controller update the sg:

/test pull-cloud-provider-aws-e2e

mtulio avatar Aug 05 '25 04:08 mtulio

increase verbosity

/test pull-cloud-provider-aws-e2e

mtulio avatar Aug 05 '25 19:08 mtulio

e2e job green.

I am also leaving the e2e more verbose in case of test network failures, helping devs troubleshooting easier CI logs of internal connectivity / internal LB. LMK if that makes sense.

Converting to regular PR. PTAL? Thanks

mtulio avatar Aug 06 '25 01:08 mtulio

Converting to draft while I return on it to rebase and run deepen investigation on e2e failures.

mtulio avatar Aug 21 '25 14:08 mtulio

FWIW interim update, this PR is still alive and need to be fixed, and proposal could be used in the logic of BYOSG in NLBs. I am planning to return on it next week to rebase and ask for final review with recent updates in the Service NLB and e2e.

mtulio avatar Oct 02 '25 16:10 mtulio

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from elmiko. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Oct 21 '25 13:10 k8s-ci-robot

PR rebased including/merging managed NLB SG, ensuring readiness with new e2e tests validating the fix:

/test pull-cloud-provider-aws-e2e

mtulio avatar Oct 21 '25 13:10 mtulio