Introduce config opt-in NLB provisioning with Security Groups
What type of PR is this?
/kind feature
What this PR does / why we need it:
Introduce frontend Security Group for NLB on Service type-LoadBalancer creation by opt-in through cloud-config.
Which issue(s) this PR fixes: Fixes #
Special notes for your reviewer:
Done checklist:
- [x] introduce cloud-config to signalize the controller to always provision Security Group on Service type-LoadBalancer NLB
- [x] Create Security Group when NLB is created, and global config is added
- [x] Delete Security Group when NLB is removed
- [ ] Review SG rules logic
- [ ] Elaborate an e2e scenario that allows to update the cloud-config to exercise NLB+SG feature
- [ ] Resolve pending question related to the approach
Does this PR introduce a user-facing change?:
Introduce provisioning Service type-LoadBalancer NLB with Security Group as an opt-in configuration.
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Hi @mtulio. Thanks for your PR.
I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Hey @kmala, I still have some items to review in this PR, but tests are passing locally. Would you mind stamping ok-to-test to validate if there is no local addiction from my env? Thanks
/ok-to-test
/test all
This PR has been reabased and some feedback addressed.
/test pull-cloud-provider-aws-test
I wanted to test e2e :) /test pull-cloud-provider-aws-e2e
This PR is failing in the e2e test config - where I expected due the open question https://github.com/kubernetes/cloud-provider-aws/pull/1158/files#r2155582853
Re-testing with skip when the test env does not have cloud-config. /test pull-cloud-provider-aws-e2e
PR https://github.com/kubernetes/cloud-provider-aws/pull/1159 merged, re-testing: /test all
Rebased and mostly successful tested main functionality locally. Also introduce devel and feature documentation.
The e2e is mostly skipped in other distributions than static[1] defined as I didn't find a good way to dynamically discover it yet.
/test all
[1] https://github.com/kubernetes/cloud-provider-aws/pull/1158/files#diff-05b1c14f2de829d8a0c5f65b1b492a9ed9ab9d100ce6daa89d2d2347c8a14c77R446-R449
Rebased and mostly successful tested main functionality locally. Also introduce devel and feature documentation.
The e2e is mostly skipped in other distributions than static[1] defined as I didn't find a good way to dynamically discover it yet.
/test all
[1] https://github.com/kubernetes/cloud-provider-aws/pull/1158/files#diff-05b1c14f2de829d8a0c5f65b1b492a9ed9ab9d100ce6daa89d2d2347c8a14c77R446-R449
Fixed units that was previously failing due v2 migration. Re-testing:
/test all
Tests are passing, converting to regular PR. PTAL? Thanks! /assign @elmiko @kmala @dims
I also investigating a bug in the logic of calculating ip permissions difference when a new port is added to existing service, looks like the migration to SDKv2 would be generating a wrong calculation with IPPermissionSet in setSecurityGroupIngress()
This is the result of my investigation so far:
- The actual code is comparing pointers when using
IPPermissionSet.Difference()
https://github.com/kubernetes/cloud-provider-aws/blob/af0e606f271a47809c50e0e84a7249f1d188f229/pkg/providers/v1/aws.go#L1329-L1330
Generating in wrong diff:
>>> permissions:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0
>>> actual:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0
>>> add:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0
>>> remove:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0
Making the request to fail as rule already exists:
I0709 22:38:23.258828 2259835 aws.go:1370] Adding security group ingress: sg-07a2c90ede4b1acac [{0xc000f99748 0xc000995ec0 [{0xc000a50160 <nil> {}}] [] [] 0xc000f9974c [] {}} {0xc000f99768 0xc000995ed0 [{0xc000a50160 <nil> {}}] [] [] 0xc000f9976c [] {}} {0xc000f99788 0xc000995ee0 [{0xc000a50160 <nil> {}}] [] [] 0xc000f9978c [] {}}]
E0709 22:38:23.510168 2259835 controller.go:301] "Unhandled Error" err="error processing service app-sample/app-sample-svc-ccm (retrying with exponential backoff): failed to ensure load balancer: error ensuring NLB security group rules: error while updating rules to security group \"sg-07a2c90ede4b1acac\": creating ingress rules for security group \"sg-07a2c90ede4b1acac\": error authorizing security group ingress: \"operation error EC2: AuthorizeSecurityGroupIngress, https response error StatusCode: 400, RequestID: 397171df-1aa6-40e5-9748-0ac6b2bd7629, api error InvalidPermission.Duplicate: the specified rule \\\"peer: 0.0.0.0/0, ICMP, type: 3, code: 4, ALLOW\\\" already exists\"" logger="UnhandledError"
I0709 22:38:23.510311 2259835 event.go:389] "Event occurred" object="app-sample/app-sample-svc-ccm" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: error ensuring NLB security group rules: error while updating rules to security group \"sg-07a2c90ede4b1acac\": creating ingress rules for security group \"sg-07a2c90ede4b1acac\": error authorizing security group ingress: \"operation error EC2: AuthorizeSecurityGroupIngress, https response error StatusCode: 400, RequestID: 397171df-1aa6-40e5-9748-0ac6b2bd7629, api error InvalidPermission.Duplicate: the specified rule \\\"peer: 0.0.0.0/0, ICMP, type: 3, code: 4, ALLOW\\\" already exists\""
This seems to be caused as in v2 the IpPermission aren't using the values anymore, leading to wrong comparations.
- Patching the code to iterating over sets, would resolve the problem:
>>> permissions:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0
>>> actual:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0
>>> add:
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0
>>> remove:
I0709 22:51:49.472567 2274177 aws.go:1402] Adding security group ingress: sg-07a2c90ede4b1acac [{0xc0007abd90 0xc000da70a0 [{0xc000e8a0a0 <nil> {}}] [] [] 0xc0007abd94 [] {}}]
I also tested the update remove port flow/rule with my patch and it is working as expected:
>>> permissions:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0
>>> actual:
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0
>>> add:
>>> remove:
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0
I0709 23:26:21.834824 2286979 aws.go:1409] Remove security group ingress: sg-00c0be91a0bcf46c6 [{0xc000387f78 0xc000f0f1c0 [{0xc000f0f1d0 <nil> {}}] [] [] 0xc000387fa8 [] {}}]
I0709 23:26:25.087161 2286979 aws.go:1530] Removing security group ingress: sg-0981d23751d8f01f1 [{0xc0004a9128 0xc001129f00 [{0xc001129f10 0xc001129f20 {}}] [] [] 0xc0004a9148 [] {}} {0xc0004a9128 0xc001129f00 [{0xc001129f30 0xc001129f40 {}}] [] [] 0xc0004a9148 [] {}}]
Version with BYO SG support with single SG.
/test all
/test pull-cloud-provider-aws-test
Testing after rebase pulling docs update.
/test pull-cloud-provider-aws-test
/test pull-cloud-provider-aws-e2e
PR rebased and removed BYO SG annotation to be handled in follow up PRs: /test all
I restored the BYO SG config that won't increase the complexity. I also added generic e2e covering BYO SG. The managed SG is a bit complex yet as I need to change the cloud-config and it's not clear to me how to do it in the existing framework.
/test all
fixed unit after renaming error out msg.
/test pull-cloud-provider-aws-test /test pull-cloud-provider-aws-e2e
increaseing verbosity in the e2e, I am also observing timeout waiting for LB in another jobs, wondering if this is something related to the CI account as I can run locally the e2e tests continuously without failures:
/test pull-cloud-provider-aws-e2e
$ $(which time) ./e2e.test --ginkgo.v --ginkgo.focus 'BYO Security Group'
W0722 23:27:59.897160 1570144 test_context.go:478] Unable to find in-cluster config, using default host : https://127.0.0.1:6443
Jul 22 23:27:59.897: INFO: The --provider flag is not set. Continuing as if --provider=skeleton had been used.
Running Suite: AWS Cloud Provider End-to-End Tests
=========================================================================================
Random Seed: 1753237679 - will randomize all specs
Will run 1 of 9 specs
------------------------------
[cloud-provider-aws-e2e] loadbalancer NLB with BYO Security Group annotation should use provided security group
[....]
Jul 22 23:31:31.931: INFO: Successfully cleaned up security group test-byo-sg-lbconfig-test-nlb-byo-sg (sg-0480ff71f9947625d)
STEP: Destroying namespace "cloud-provider-aws-39" for this suite. @ 07/22/25 23:31:31.931
• [212.195 seconds]
------------------------------
SSSSSSSS
Ran 1 of 9 Specs in 212.195 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 8 Skipped
PASS
0.13user 0.04system 3:32.22elapsed 0%CPU (0avgtext+0avgdata 51816maxresident)k
0inputs+0outputs (0major+5139minor)pagefaults 0swaps
local tests are passing, both unit and e2e:
$ $(which time) ./e2e.test --ginkgo.v --ginkgo.focus 'loadbalancer'
...
Ran 5 of 9 Specs in 1416.155 seconds
SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 4 Skipped
PASS
0.72user 0.26system 23:36.18elapsed 0%CPU (0avgtext+0avgdata 54008maxresident)k
0inputs+0outputs (0major+7375minor)pagefaults 0swaps
Checking CI:
/test pull-cloud-provider-aws-e2e
The current version has the BYO SG on NLB too, including the fixes of update from managed to BYO, issue https://github.com/kubernetes/cloud-provider-aws/issues/1208 and fix https://github.com/kubernetes/cloud-provider-aws/pull/1209 . I am considering this as a blocker. I am still checking if isolating BYO from this (managed) PR would provide a better review experience. cc @JoelSpeed @elmiko
/test pull-cloud-provider-aws-e2e
$ $(which time) ./e2e.test --ginkgo.v --ginkgo.focus 'NLB'
[...]
Ran 2 of 8 Specs in 978.200 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 6 Skipped
PASS
0.21user 0.07system 16:18.22elapsed 0%CPU (0avgtext+0avgdata 53004maxresident)k
32inputs+0outputs (1major+5489minor)pagefaults 0swaps
Looking green. /test all
Converting this PR to ready. PTAL, @elmiko @JoelSpeed ?
just starting to take a look at this.
Observing in the last run:
The maximum number of target groups has been reached
/test pull-cloud-provider-aws-e2e
I will rebase tough and try again. Quickly triage:
- pull-cloud-provider-aws-e2e: Got stuck downloading golang dependencies.
- pull-cloud-provider-aws-e2e-kubetest2: re-testing as the same test is being skipped in other runs.
/test pull-cloud-provider-aws-e2e-kubetest2 /test pull-cloud-provider-aws-e2e