cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
Cannot remove subnets from NLB due to AWS restriction
/kind bug
What steps did you take and what happened: We are using NLB load balancers for our control plane endpoints and we recently had the case, where a customer scaled down their control plane from three nodes in three separate AZs to one node in one AZ. This caused an error in the CAPA bootstrap controller stating:
failed to reconcile load balancer" err=<
failed to set subnets for apiserver load balancer '***': ValidationError: Subnet removal is not supported for Network Load Balancers. You must specify all existing subnets along with any new ones
status code: 400, request id: ***
This concerns essentially the same code as issue #4357.
What did you expect to happen:
The only solution, as far as I can tell, would be to delete and recreate the NLB, if the number of subnets decreases.
Anything else you would like to add:
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/2d96288354dee44e60710048652267e2bf3e8c12/pkg/cloud/services/elb/loadbalancer.go#L113-L123
My suggested change would be:
// Reconcile the subnets and availability zones from the spec
// and the ones currently attached to the load balancer.
if len(lb.SubnetIDs) != len(spec.SubnetIDs) {
if lb.LoadBalancerType == infrav1.LoadBalancerTypeNLB && len(lb.SubnetIDs) > len(spec.SubnetIDs) {
err = s.deleteExistingNLBs()
if err != nil {
return errors.Wrapf(err, "failed to delete apiserver load balancer %q", lb.Name)
}
lb, err = s.createLB(spec)
if err != nil {
return errors.Wrapf(err, "failed to create apiserver load balancer %q", lb.Name)
}
} else {
_, err := s.ELBV2Client.SetSubnets(&elbv2.SetSubnetsInput{
LoadBalancerArn: &lb.ARN,
Subnets: aws.StringSlice(spec.SubnetIDs),
})
if err != nil {
return errors.Wrapf(err, "failed to set subnets for apiserver load balancer '%s'", lb.Name)
}
}
}
Please keep in mind, that this is not tried yet.
Environment:
- Cluster-api-provider-aws version: > v2.1.0
- Kubernetes version: (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:40:17Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.10-dhc", GitCommit:"b8609d4dd75c5d6fba4a5eaa63a5507cb39a6e99", GitTreeState:"dirty", BuildDate:"2023-11-16T17:01:19Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}
- OS (e.g. from
/etc/os-release): Ubuntu 20.04.6 LTS
Philipp Schöppner <[email protected]>, Mercedes-Benz Tech Innovation GmbH (Provider Information)
This issue is currently awaiting triage.
If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
NLBs are definitely difficult to modify, so deleting and recreating the LB makes sense. I'm concerned that this will be disruptive to the cluster overall since the LB's disappearing during normal operation.
Can you talk more about situations when you'd remove existing subnets from the LB?
scaled down their control plane from three nodes in three separate AZs to one node in one AZ.
I had skipped past this while reading; given this scenario, I think it's reasonable to expect interruption on the LB's availability. Does that make sense to you, @schoeppi5?
Hey @nrb, thanks for your question. Yes, unavailability in this scenario is an expected and accepted behaviour, so we are fine with it. I also don't think there is a lot we could about this issue, since it is a limitation of the NLB itself.
Cool, that makes sense. And you're right, we can't do a whole lot about it since it's how NLBs work.
I do think there's a concern that someone removes a subnet and _isn'_t doing such a downscaling operation. I'd like to give this a little more thought in terms of general risk. I'll also add it to the community meeting notes to highlight it for maintainers.
Hey @nrb, I wasn't able to join the community meeting on April 8th, but I can see, that you added this PR to the meeting notes. Did you get a chance to discuss this and - if so - what was the outcome?
Were you able to give this topic some more thought? Is there anything I can do to assist?
Apologies - the community meeting did not happen on April 8 - we decided to resume the normal schedule on April 15.
I haven't given it a lot more thought, but I think this is fairly low risk given the scenarios where users would change things. I'll still bring it up and double check that others don't have objections.
From the community meeting:
Would we be able to guarantee that the NLB will get the same IP?
We may also want to make this an opt-in feature, since it is destructive and could surprise users.
Overall, no objections, though.
Any update? So messages like these in my event logs.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten