cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

Controller leaks Load Balancer when Service annotation `service.beta.kubernetes.io/aws-load-balancer-type` is added/removed

Open mtulio opened this issue 3 months ago • 10 comments

What happened:

The controller is leaking load balancers when the annotation service.beta.kubernetes.io/aws-load-balancer-type is updated (added or removed). This is not something usual, but it's a valid user flow that can lead to unexpected leaked resources.

Scenarios:

    1. Given a Service created without annotation service.beta.kubernetes.io/aws-load-balancer-type, then updated to the annotation service.beta.kubernetes.io/aws-load-balancer-type=nlb, a NLB will be created, and CLB will be leaked (and it's SGs)
    1. Given a Service created with annotation service.beta.kubernetes.io/aws-load-balancer-type=nlb, then the annotation is removed, the controller will create a CLB and leak the NLB

What you expected to happen:

Options to discuss with community:

  • Controller should not create a new load balancer when the annotation service.beta.kubernetes.io/aws-load-balancer-type is added or removed from existing services which already have load balancers
  • Controller should "recreate" (e.g. delete CLB and add NLB) when the annotation is service.beta.kubernetes.io/aws-load-balancer-type

How to reproduce it (as minimally and precisely as possible):

CLB leak:

  1. Create a Service (without annotation service.beta.kubernetes.io/aws-load-balancer-type
  2. Add the annotation service.beta.kubernetes.io/aws-load-balancer-type=nlb
  3. Check in the AWS console that the NLB has been created
  4. Check in the AWS Console that the CLB has been leaked

Anything else we need to know?:

What is the expected option/behavior in this case considering there is no validation webhook available in CCM?

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/kind bug

mtulio avatar Sep 04 '25 18:09 mtulio

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Sep 04 '25 18:09 k8s-ci-robot

cc @elmiko @JoelSpeed @nrb @kmala

mtulio avatar Sep 04 '25 18:09 mtulio

Marco and i stumbled across this while testing the #1158 PR

elmiko avatar Sep 04 '25 18:09 elmiko

Upstream documentation already notes that this value should not be changed (see here)

ALBO documentation explains this as well (see here)

But there's no enforcement of this since there is nothing to stop someone changing this value at runtime.

You could use a ValidatingAdmissionPolicy to prevent this, but I don't think the CCMs have any way to guarantee this presently

In theory, we could implement a status through annotations on the service object, but this has historically been avoided, in favour of being able to use predictable names to find existing resources. Perhaps the logic can be updated to check for orphaned resources that appear to be from the service?

JoelSpeed avatar Sep 08 '25 11:09 JoelSpeed

You could use a ValidatingAdmissionPolicy to prevent this, but I don't think the CCMs have any way to guarantee this presently

It's really a good idea if we have a way to do it, specially if there are something in the core (cloud-provider) then provider's specific implementations write their expressions accordingly.

Perhaps the logic can be updated to check for orphaned resources that appear to be from the service?

This would be an option, but would increase controller's API calls when syncronizing LBs, considering there is no active re-conciliator loop to the Services, it may not increase much per service (kicking vaguely two maybe per add/update?: describe CLBs and NLBs from Names and validate type against the annotation?)

mtulio avatar Sep 08 '25 18:09 mtulio

but would increase controller's API calls when syncronizing LBs

Yes, I suspect this is not the way we would want to go because of those additional API calls that would be introduced

JoelSpeed avatar Sep 10 '25 17:09 JoelSpeed

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 09 '25 18:12 k8s-triage-robot

/remove-lifecycle stale

@mtulio i see we had a PR reference this issue that has merged, do we have further work to do here?

elmiko avatar Dec 10 '25 15:12 elmiko

@elmiko , the linked PR covers a feature we we found this issue, and not the issue fixes.

I did not have much time to validate other ideas than exposed above, specially considering the viable one for short term may not the way we want to go.

mtulio avatar Dec 10 '25 16:12 mtulio

ack, thanks for the update!

elmiko avatar Dec 10 '25 17:12 elmiko