Fix leak managed/owned security group on Service update with BYO SG
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a leaked security group (SG) when a Service type-loadBalancer (CLB) is updated adding the BYO SG annotation (service.beta.kubernetes.io/aws-load-balancer-security-groups), which replaces all SG added to the Load Balancer without removing rules and deleting it when created by controller.
Which issue(s) this PR fixes:
Fixes #1208
Special notes for your reviewer:
The approach of creating isolated function was used specially to:
- enhance code maintenance
- enhance unit tests
- allow to reuse the logic when NLB with SG is supported
The unit tests and documentation(function) comments have been assisted by Cursor AI(model claude-4-sonet): AIA HAb SeCeNc Hin R v1.0
Does this PR introduce a user-facing change?:
Fixed security group leak when updating Classic Load Balancer services with `service.beta.kubernetes.io/aws-load-balancer-security-groups` annotation. Controller-managed security groups are now properly cleaned up when switching to user-specified security groups.
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Hi @mtulio. Thanks for your PR.
I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/ok-to-test
/test all
Fixing doc strings and failed unit tests from previous unexpected behavior:
/test all
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/test pull-cloud-provider-aws-e2e-kubetest2
/test all
I can't find connection between failures in pull-cloud-provider-aws-e2e-kubetest2 and existing changes.
I am going to convert to regular PR to ask for reviewers while we observe if this isnt a CI flake.
PTAL? /assign @kmala @elmiko @JoelSpeed
This PR is ready for review. PTAL?
/test pull-cloud-provider-aws-e2e-kubetest2-quick
Experiencing some instability on CI, failed to build image. Retrying: /test pull-cloud-provider-aws-e2e
/test pull-cloud-provider-aws-e2e
Rebase and testing e2e that got stuck on downloading dependencies. Waiting for new results.
I am going to write a new e2e test case covering the update of from managed to BYO SG on CLB to increase confidence to this PR.
Introduced a new e2e test case to validate the full workflow of proposed fix.
$ ./e2e.test --ginkgo.v --ginkgo.focus="CLB with managed Security Group"
W0801 16:24:41.940826 487078 test_context.go:478] Unable to find in-cluster config, using default host : https://127.0.0.1:6443
Aug 1 16:24:41.940: INFO: The --provider flag is not set. Continuing as if --provider=skeleton had been used.
Running Suite: AWS Cloud Provider End-to-End Tests - /home/redacted
=========================================================================================
Random Seed: 1754076281 - will randomize all specs
Will run 1 of 9 specs
SS
------------------------------
[cloud-provider-aws-e2e] loadbalancer CLB with managed Security Group must update to BYO Security Group
[...]
STEP: executing pre-test hook @ 08/01/25 16:24:56.104
Aug 1 16:24:56.104: INFO: running hook post-service-config patching service annotation with BYO security group
Aug 1 16:24:56.104: INFO: Getting load balancer security groups with address a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com
Aug 1 16:25:03.887: INFO: Load balancer a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com has security groups: [sg-037bc641dfde2e5b6]
Aug 1 16:25:03.887: INFO: Checking if managed SGs are owned by the controller
Aug 1 16:25:04.147: INFO: Security group "sg-037bc641dfde2e5b6" is managed: true
Aug 1 16:25:04.147: INFO: Creating BYO SG with name "cloud-provider-aws-5233-lbconfig-test-clb-sg-sg-byo"
Aug 1 16:25:04.570: INFO: BYO SG "cloud-provider-aws-5233-lbconfig-test-clb-sg-sg-byo" created with ID "sg-0c7a7bb83076403be"
Aug 1 16:25:04.570: INFO: Authorizing BYO SG "sg-0c7a7bb83076403be" to service ports: [{Name:port-0 Protocol:TCP AppProtocol:<nil> Port:80 TargetPort:{Type:0 IntVal:8080 StrVal:} NodePort:32297}]
Aug 1 16:25:04.570: INFO: Authorizing security group sg-0c7a7bb83076403be for 1 ports
Aug 1 16:25:04.570: INFO: Adding rule: protocol=tcp, port=80, source=0.0.0.0/0
Aug 1 16:25:04.570: INFO: Calling AWS AuthorizeSecurityGroupIngress for security group sg-0c7a7bb83076403be with 1 rules
Aug 1 16:25:04.893: INFO: Successfully authorized security group sg-0c7a7bb83076403be.
Aug 1 16:25:04.893: INFO: BYO SG "sg-0c7a7bb83076403be" authorized to service ports
Aug 1 16:25:04.893: INFO: Verifying security group rules were created for BYO SG "sg-0c7a7bb83076403be"
Aug 1 16:25:05.141: INFO: Security group sg-0c7a7bb83076403be has 1 ingress rules:
Aug 1 16:25:05.141: INFO: Rule 1: protocol=tcp, ports=80-80, sources=[0.0.0.0/0]
Aug 1 16:25:05.141: INFO: ✓ Found rule for protocol=tcp port=80 in security group sg-0c7a7bb83076403be
Aug 1 16:25:05.141: INFO: Patching Service "lbconfig-test-clb-sg" with BYO SG "sg-0c7a7bb83076403be"
Aug 1 16:25:05.298: INFO: Waiting for load balancer to be updated
Aug 1 16:25:15.307: INFO: Getting load balancer security groups after update
Aug 1 16:25:15.533: INFO: Load balancer a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com has security groups: [sg-0c7a7bb83076403be]
Aug 1 16:25:15.533: INFO: Checking if LB is using BYO SG "sg-0c7a7bb83076403be"
Aug 1 16:25:15.533: INFO: Load balancer a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com has BYO security group "sg-0c7a7bb83076403be"
Aug 1 16:25:15.533: INFO: Checking if managed SGs were removed
Aug 1 16:25:15.533: INFO: Checking if managed SG "sg-037bc641dfde2e5b6" is removed by controller
Aug 1 16:25:15.775: INFO: Managed security group "sg-037bc641dfde2e5b6" removed
Aug 1 16:25:15.775: INFO: pre-test hook completed
[...]
STEP: testing HTTP connectivity from external client @ 08/01/25 16:25:15.775
Aug 1 16:25:15.775: INFO: [TEST] Running external connectivity test to a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com:80
Aug 1 16:25:15.775: INFO: Poking "http://a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com:80/echo?msg=hello"
Aug 1 16:25:15.840: INFO: Poke("http://a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com:80/echo?msg=hello"): Get "http://a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com:80/echo?msg=hello": dial tcp: lookup a8874a05a484e4f578e552b4a6cd9724-712231364.us-east-1.elb.amazonaws.com: no such host
[...]
Aug 1 16:26:16.183: INFO: [TEST] HTTP connectivity test completed successfully
STEP: cleaning up: converting service to ClusterIP @ 08/01/25 16:26:16.183
STEP: cleaning up: waiting for load balancer destruction @ 08/01/25 16:26:16.497
Aug 1 16:26:16.497: INFO: [CLEANUP] Waiting for load balancer destruction
Aug 1 16:26:16.497: INFO: Waiting up to 15m0s for service "lbconfig-test-clb-sg" to have no LoadBalancer
Aug 1 16:26:16.649: INFO: [CLEANUP] Load balancer destroyed successfully
Aug 1 16:26:16.649: INFO: Cleaning up e2e resources
Aug 1 16:26:16.649: INFO: Cleaning up security group sg-0c7a7bb83076403be
Aug 1 16:26:17.683: INFO: Security group sg-0c7a7bb83076403be still has dependencies, waiting... (failed to delete security group sg-0c7a7bb83076403be: operation error EC2: DeleteSecurityGroup, https response error StatusCode: 400, RequestID: d048bc4c-9cb5-474d-b463-5635cd3c46f9, api error DependencyViolation: resource sg-0c7a7bb83076403be has a dependent object)
[...]
Aug 1 16:26:36.764: INFO: Successfully deleted security group sg-0c7a7bb83076403be
Aug 1 16:26:36.764: INFO: Successfully cleaned up security group sg-0c7a7bb83076403be
STEP: Destroying namespace "cloud-provider-aws-5233" for this suite. @ 08/01/25 16:26:36.764
• [114.977 seconds]
------------------------------
SSSSSS
Ran 1 of 9 Specs in 114.977 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 8 Skipped
PASS
/test pull-cloud-provider-aws-e2e
I also got green results from of all lb tests:
$ ./e2e.test --ginkgo.v --ginkgo.focus="loadbalancer"
...
Ran 5 of 9 Specs in 1291.068 seconds
SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 4 Skipped
PASS
/test pull-cloud-provider-aws-e2e
/test pull-cloud-provider-aws-e2e
unrelated failure in the lb-internal tests which expects to fail (hairpinning). Checking if it was flake:
/test pull-cloud-provider-aws-e2e
Review comments addressed, new e2e added and e2e passing on CI. This PR is ready for review. PTAL? Thanks
e2e test [cloud-provider-aws-e2e] loadbalancer CLB with managed Security Group must update to BYO Security Group was added, here are the logs of the step which performs the Service update/checks: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1209/pull-cloud-provider-aws-e2e/1952377003214639104#1:build-log.txt%3A2140-2167
Investigating if the failure AWS Cloud Provider End-to-End Tests: [It] [cloud-provider-aws-e2e] loadbalancer CLB internal should be reachable with hairpinning traffic in is related to e2e updates.
converting to draft while increasing debug on internal test of CLB, looks like it's failing to retrieve pod information, checking if this is related to the service account. Once I get more information and isolate the issue I will return to ready.
/test pull-cloud-provider-aws-e2e
FAIL: expected managed security group "sg-0a917e4b14501635f" removed by controller, got "sg-0a917e4b14501635f"
Checking if I need to enhance the controller update the sg:
/test pull-cloud-provider-aws-e2e
increase verbosity
/test pull-cloud-provider-aws-e2e
e2e job green.
I am also leaving the e2e more verbose in case of test network failures, helping devs troubleshooting easier CI logs of internal connectivity / internal LB. LMK if that makes sense.
Converting to regular PR. PTAL? Thanks
Converting to draft while I return on it to rebase and run deepen investigation on e2e failures.
FWIW interim update, this PR is still alive and need to be fixed, and proposal could be used in the logic of BYOSG in NLBs. I am planning to return on it next week to rebase and ask for final review with recent updates in the Service NLB and e2e.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from elmiko. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
PR rebased including/merging managed NLB SG, ensuring readiness with new e2e tests validating the fix:
/test pull-cloud-provider-aws-e2e