Controller leaks Load Balancer when Service annotation `service.beta.kubernetes.io/aws-load-balancer-type` is added/removed
What happened:
The controller is leaking load balancers when the annotation service.beta.kubernetes.io/aws-load-balancer-type is updated (added or removed). This is not something usual, but it's a valid user flow that can lead to unexpected leaked resources.
Scenarios:
-
- Given a Service created without annotation
service.beta.kubernetes.io/aws-load-balancer-type, then updated to the annotationservice.beta.kubernetes.io/aws-load-balancer-type=nlb, a NLB will be created, and CLB will be leaked (and it's SGs)
- Given a Service created without annotation
-
- Given a Service created with annotation
service.beta.kubernetes.io/aws-load-balancer-type=nlb, then the annotation is removed, the controller will create a CLB and leak the NLB
- Given a Service created with annotation
What you expected to happen:
Options to discuss with community:
- Controller should not create a new load balancer when the annotation
service.beta.kubernetes.io/aws-load-balancer-typeis added or removed from existing services which already have load balancers - Controller should "recreate" (e.g. delete CLB and add NLB) when the annotation is
service.beta.kubernetes.io/aws-load-balancer-type
How to reproduce it (as minimally and precisely as possible):
CLB leak:
- Create a Service (without annotation
service.beta.kubernetes.io/aws-load-balancer-type - Add the annotation
service.beta.kubernetes.io/aws-load-balancer-type=nlb - Check in the AWS console that the NLB has been created
- Check in the AWS Console that the CLB has been leaked
Anything else we need to know?:
What is the expected option/behavior in this case considering there is no validation webhook available in CCM?
Environment:
- Kubernetes version (use
kubectl version): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others:
/kind bug
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
cc @elmiko @JoelSpeed @nrb @kmala
Marco and i stumbled across this while testing the #1158 PR
Upstream documentation already notes that this value should not be changed (see here)
ALBO documentation explains this as well (see here)
But there's no enforcement of this since there is nothing to stop someone changing this value at runtime.
You could use a ValidatingAdmissionPolicy to prevent this, but I don't think the CCMs have any way to guarantee this presently
In theory, we could implement a status through annotations on the service object, but this has historically been avoided, in favour of being able to use predictable names to find existing resources. Perhaps the logic can be updated to check for orphaned resources that appear to be from the service?
You could use a ValidatingAdmissionPolicy to prevent this, but I don't think the CCMs have any way to guarantee this presently
It's really a good idea if we have a way to do it, specially if there are something in the core (cloud-provider) then provider's specific implementations write their expressions accordingly.
Perhaps the logic can be updated to check for orphaned resources that appear to be from the service?
This would be an option, but would increase controller's API calls when syncronizing LBs, considering there is no active re-conciliator loop to the Services, it may not increase much per service (kicking vaguely two maybe per add/update?: describe CLBs and NLBs from Names and validate type against the annotation?)
but would increase controller's API calls when syncronizing LBs
Yes, I suspect this is not the way we would want to go because of those additional API calls that would be introduced
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
@mtulio i see we had a PR reference this issue that has merged, do we have further work to do here?
@elmiko , the linked PR covers a feature we we found this issue, and not the issue fixes.
I did not have much time to validate other ideas than exposed above, specially considering the viable one for short term may not the way we want to go.
ack, thanks for the update!