cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

Introduce config opt-in NLB provisioning with Security Groups

Open mtulio opened this issue 6 months ago • 12 comments

What type of PR is this?

/kind feature

What this PR does / why we need it:

Introduce frontend Security Group for NLB on Service type-LoadBalancer creation by opt-in through cloud-config.

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Done checklist:

  • [x] introduce cloud-config to signalize the controller to always provision Security Group on Service type-LoadBalancer NLB
  • [x] Create Security Group when NLB is created, and global config is added
  • [x] Delete Security Group when NLB is removed
  • [ ] Review SG rules logic
  • [ ] Elaborate an e2e scenario that allows to update the cloud-config to exercise NLB+SG feature
  • [ ] Resolve pending question related to the approach

Does this PR introduce a user-facing change?:

Introduce provisioning Service type-LoadBalancer NLB with Security Group as an opt-in configuration.

mtulio avatar Jun 11 '25 21:06 mtulio

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 11 '25 21:06 k8s-ci-robot

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 11 '25 21:06 k8s-ci-robot

Hey @kmala, I still have some items to review in this PR, but tests are passing locally. Would you mind stamping ok-to-test to validate if there is no local addiction from my env? Thanks

mtulio avatar Jun 11 '25 22:06 mtulio

/ok-to-test

kmala avatar Jun 11 '25 23:06 kmala

/test all

mtulio avatar Jun 13 '25 03:06 mtulio

This PR has been reabased and some feedback addressed.

/test pull-cloud-provider-aws-test

mtulio avatar Jun 18 '25 23:06 mtulio

I wanted to test e2e :) /test pull-cloud-provider-aws-e2e

mtulio avatar Jun 18 '25 23:06 mtulio

This PR is failing in the e2e test config - where I expected due the open question https://github.com/kubernetes/cloud-provider-aws/pull/1158/files#r2155582853

mtulio avatar Jun 20 '25 18:06 mtulio

Re-testing with skip when the test env does not have cloud-config. /test pull-cloud-provider-aws-e2e

mtulio avatar Jun 21 '25 00:06 mtulio

PR https://github.com/kubernetes/cloud-provider-aws/pull/1159 merged, re-testing: /test all

mtulio avatar Jun 28 '25 04:06 mtulio

Rebased and mostly successful tested main functionality locally. Also introduce devel and feature documentation.

The e2e is mostly skipped in other distributions than static[1] defined as I didn't find a good way to dynamically discover it yet.

/test all

[1] https://github.com/kubernetes/cloud-provider-aws/pull/1158/files#diff-05b1c14f2de829d8a0c5f65b1b492a9ed9ab9d100ce6daa89d2d2347c8a14c77R446-R449

mtulio avatar Jul 08 '25 22:07 mtulio

Rebased and mostly successful tested main functionality locally. Also introduce devel and feature documentation.

The e2e is mostly skipped in other distributions than static[1] defined as I didn't find a good way to dynamically discover it yet.

/test all

[1] https://github.com/kubernetes/cloud-provider-aws/pull/1158/files#diff-05b1c14f2de829d8a0c5f65b1b492a9ed9ab9d100ce6daa89d2d2347c8a14c77R446-R449

Fixed units that was previously failing due v2 migration. Re-testing:

/test all

mtulio avatar Jul 08 '25 22:07 mtulio

Tests are passing, converting to regular PR. PTAL? Thanks! /assign @elmiko @kmala @dims

mtulio avatar Jul 09 '25 02:07 mtulio

I also investigating a bug in the logic of calculating ip permissions difference when a new port is added to existing service, looks like the migration to SDKv2 would be generating a wrong calculation with IPPermissionSet in setSecurityGroupIngress()

This is the result of my investigation so far:

  • The actual code is comparing pointers when using IPPermissionSet.Difference()

https://github.com/kubernetes/cloud-provider-aws/blob/af0e606f271a47809c50e0e84a7249f1d188f229/pkg/providers/v1/aws.go#L1329-L1330

Generating in wrong diff:

>>> permissions:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0 
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0 
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0 
>>> actual: 
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0 
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0 
>>> add:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0 
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0 
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0 
>>> remove:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0 
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0 

Making the request to fail as rule already exists:

I0709 22:38:23.258828 2259835 aws.go:1370] Adding security group ingress: sg-07a2c90ede4b1acac [{0xc000f99748 0xc000995ec0 [{0xc000a50160 <nil> {}}] [] [] 0xc000f9974c [] {}} {0xc000f99768 0xc000995ed0 [{0xc000a50160 <nil> {}}] [] [] 0xc000f9976c [] {}} {0xc000f99788 0xc000995ee0 [{0xc000a50160 <nil> {}}] [] [] 0xc000f9978c [] {}}]
E0709 22:38:23.510168 2259835 controller.go:301] "Unhandled Error" err="error processing service app-sample/app-sample-svc-ccm (retrying with exponential backoff): failed to ensure load balancer: error ensuring NLB security group rules: error while updating rules to security group \"sg-07a2c90ede4b1acac\": creating ingress rules for security group \"sg-07a2c90ede4b1acac\": error authorizing security group ingress: \"operation error EC2: AuthorizeSecurityGroupIngress, https response error StatusCode: 400, RequestID: 397171df-1aa6-40e5-9748-0ac6b2bd7629, api error InvalidPermission.Duplicate: the specified rule \\\"peer: 0.0.0.0/0, ICMP, type: 3, code: 4, ALLOW\\\" already exists\"" logger="UnhandledError"
I0709 22:38:23.510311 2259835 event.go:389] "Event occurred" object="app-sample/app-sample-svc-ccm" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: error ensuring NLB security group rules: error while updating rules to security group \"sg-07a2c90ede4b1acac\": creating ingress rules for security group \"sg-07a2c90ede4b1acac\": error authorizing security group ingress: \"operation error EC2: AuthorizeSecurityGroupIngress, https response error StatusCode: 400, RequestID: 397171df-1aa6-40e5-9748-0ac6b2bd7629, api error InvalidPermission.Duplicate: the specified rule \\\"peer: 0.0.0.0/0, ICMP, type: 3, code: 4, ALLOW\\\" already exists\""

This seems to be caused as in v2 the IpPermission aren't using the values anymore, leading to wrong comparations.

  • Patching the code to iterating over sets, would resolve the problem:
>>> permissions:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0 
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0 
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0 
>>> actual: 
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0 
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0 
>>> add:
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0 
>>> remove:
I0709 22:51:49.472567 2274177 aws.go:1402] Adding security group ingress: sg-07a2c90ede4b1acac [{0xc0007abd90 0xc000da70a0 [{0xc000e8a0a0 <nil> {}}] [] [] 0xc0007abd94 [] {}}]

mtulio avatar Jul 10 '25 02:07 mtulio

I also tested the update remove port flow/rule with my patch and it is working as expected:

>>> permissions:
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0 
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0 
>>> actual: 
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0 
ipProtocol: tcp, fromPort: 80, toPort: 80 ipRange: 0.0.0.0/0 
ipProtocol: icmp, fromPort: 3, toPort: 4 ipRange: 0.0.0.0/0 
>>> add:
>>> remove:
ipProtocol: tcp, fromPort: 81, toPort: 81 ipRange: 0.0.0.0/0 
I0709 23:26:21.834824 2286979 aws.go:1409] Remove security group ingress: sg-00c0be91a0bcf46c6 [{0xc000387f78 0xc000f0f1c0 [{0xc000f0f1d0 <nil> {}}] [] [] 0xc000387fa8 [] {}}]
I0709 23:26:25.087161 2286979 aws.go:1530] Removing security group ingress: sg-0981d23751d8f01f1 [{0xc0004a9128 0xc001129f00 [{0xc001129f10 0xc001129f20 {}}] [] [] 0xc0004a9148 [] {}} {0xc0004a9128 0xc001129f00 [{0xc001129f30 0xc001129f40 {}}] [] [] 0xc0004a9148 [] {}}]

mtulio avatar Jul 10 '25 02:07 mtulio

Version with BYO SG support with single SG.

/test all

mtulio avatar Jul 10 '25 03:07 mtulio

/test pull-cloud-provider-aws-test

mtulio avatar Jul 10 '25 18:07 mtulio

Testing after rebase pulling docs update.

/test pull-cloud-provider-aws-test

mtulio avatar Jul 10 '25 20:07 mtulio

/test pull-cloud-provider-aws-e2e

mtulio avatar Jul 10 '25 21:07 mtulio

PR rebased and removed BYO SG annotation to be handled in follow up PRs: /test all

mtulio avatar Jul 22 '25 16:07 mtulio

I restored the BYO SG config that won't increase the complexity. I also added generic e2e covering BYO SG. The managed SG is a bit complex yet as I need to change the cloud-config and it's not clear to me how to do it in the existing framework.

/test all

mtulio avatar Jul 22 '25 22:07 mtulio

fixed unit after renaming error out msg.

/test pull-cloud-provider-aws-test /test pull-cloud-provider-aws-e2e

mtulio avatar Jul 22 '25 22:07 mtulio

increaseing verbosity in the e2e, I am also observing timeout waiting for LB in another jobs, wondering if this is something related to the CI account as I can run locally the e2e tests continuously without failures:

/test pull-cloud-provider-aws-e2e

$ $(which time) ./e2e.test --ginkgo.v  --ginkgo.focus 'BYO Security Group'
  W0722 23:27:59.897160 1570144 test_context.go:478] Unable to find in-cluster config, using default host : https://127.0.0.1:6443
  Jul 22 23:27:59.897: INFO: The --provider flag is not set. Continuing as if --provider=skeleton had been used.
Running Suite: AWS Cloud Provider End-to-End Tests 
=========================================================================================
Random Seed: 1753237679 - will randomize all specs

Will run 1 of 9 specs
------------------------------
[cloud-provider-aws-e2e] loadbalancer NLB with BYO Security Group annotation should use provided security group
[....]
  Jul 22 23:31:31.931: INFO: Successfully cleaned up security group test-byo-sg-lbconfig-test-nlb-byo-sg (sg-0480ff71f9947625d)
  STEP: Destroying namespace "cloud-provider-aws-39" for this suite. @ 07/22/25 23:31:31.931
• [212.195 seconds]
------------------------------
SSSSSSSS

Ran 1 of 9 Specs in 212.195 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 8 Skipped
PASS
0.13user 0.04system 3:32.22elapsed 0%CPU (0avgtext+0avgdata 51816maxresident)k
0inputs+0outputs (0major+5139minor)pagefaults 0swaps

mtulio avatar Jul 23 '25 02:07 mtulio

local tests are passing, both unit and e2e:

$ $(which time) ./e2e.test --ginkgo.v  --ginkgo.focus 'loadbalancer'
...
Ran 5 of 9 Specs in 1416.155 seconds
SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 4 Skipped
PASS
0.72user 0.26system 23:36.18elapsed 0%CPU (0avgtext+0avgdata 54008maxresident)k
0inputs+0outputs (0major+7375minor)pagefaults 0swaps

Checking CI:

/test pull-cloud-provider-aws-e2e

mtulio avatar Jul 23 '25 21:07 mtulio

The current version has the BYO SG on NLB too, including the fixes of update from managed to BYO, issue https://github.com/kubernetes/cloud-provider-aws/issues/1208 and fix https://github.com/kubernetes/cloud-provider-aws/pull/1209 . I am considering this as a blocker. I am still checking if isolating BYO from this (managed) PR would provide a better review experience. cc @JoelSpeed @elmiko

/test pull-cloud-provider-aws-e2e


$ $(which time) ./e2e.test --ginkgo.v  --ginkgo.focus 'NLB'
[...]
Ran 2 of 8 Specs in 978.200 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 6 Skipped
PASS
0.21user 0.07system 16:18.22elapsed 0%CPU (0avgtext+0avgdata 53004maxresident)k
32inputs+0outputs (1major+5489minor)pagefaults 0swaps

mtulio avatar Jul 24 '25 15:07 mtulio

Looking green. /test all

mtulio avatar Jul 24 '25 17:07 mtulio

Converting this PR to ready. PTAL, @elmiko @JoelSpeed ?

mtulio avatar Jul 24 '25 18:07 mtulio

just starting to take a look at this.

elmiko avatar Jul 24 '25 20:07 elmiko

Observing in the last run:

The maximum number of target groups has been reached

/test pull-cloud-provider-aws-e2e

mtulio avatar Jul 29 '25 14:07 mtulio

I will rebase tough and try again. Quickly triage:

/test pull-cloud-provider-aws-e2e-kubetest2 /test pull-cloud-provider-aws-e2e

mtulio avatar Jul 29 '25 20:07 mtulio