cluster-api-provider-aws Cannot remove subnets from NLB due to AWS restriction

trafficstars

/kind bug

What steps did you take and what happened: We are using NLB load balancers for our control plane endpoints and we recently had the case, where a customer scaled down their control plane from three nodes in three separate AZs to one node in one AZ. This caused an error in the CAPA bootstrap controller stating:

failed to reconcile load balancer" err=<
	failed to set subnets for apiserver load balancer '***': ValidationError: Subnet removal is not supported for Network Load Balancers. You must specify all existing subnets along with any new ones
		status code: 400, request id: ***

This concerns essentially the same code as issue #4357.

What did you expect to happen:

The only solution, as far as I can tell, would be to delete and recreate the NLB, if the number of subnets decreases.

Anything else you would like to add:

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/2d96288354dee44e60710048652267e2bf3e8c12/pkg/cloud/services/elb/loadbalancer.go#L113-L123

My suggested change would be:

// Reconcile the subnets and availability zones from the spec
// and the ones currently attached to the load balancer.
if len(lb.SubnetIDs) != len(spec.SubnetIDs) {
	if lb.LoadBalancerType == infrav1.LoadBalancerTypeNLB && len(lb.SubnetIDs) > len(spec.SubnetIDs) {
		err = s.deleteExistingNLBs()
		if err != nil {
			return errors.Wrapf(err, "failed to delete apiserver load balancer %q", lb.Name)
		}
		lb, err = s.createLB(spec)
		if err != nil {
			return errors.Wrapf(err, "failed to create apiserver load balancer %q", lb.Name)
		}
	} else {
		_, err := s.ELBV2Client.SetSubnets(&elbv2.SetSubnetsInput{
			LoadBalancerArn: &lb.ARN,
			Subnets:         aws.StringSlice(spec.SubnetIDs),
		})
		if err != nil {
			return errors.Wrapf(err, "failed to set subnets for apiserver load balancer '%s'", lb.Name)
		}
	}
}

Please keep in mind, that this is not tried yet.

Environment:

Cluster-api-provider-aws version: > v2.1.0
Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:40:17Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.10-dhc", GitCommit:"b8609d4dd75c5d6fba4a5eaa63a5507cb39a6e99", GitTreeState:"dirty", BuildDate:"2023-11-16T17:01:19Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

OS (e.g. from /etc/os-release): Ubuntu 20.04.6 LTS

_{Philipp Schöppner <[email protected]>, Mercedes-Benz Tech Innovation GmbH (Provider Information)}

Jan 15 '24 08:01 schoeppi5

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 15 '24 08:01 k8s-ci-robot

NLBs are definitely difficult to modify, so deleting and recreating the LB makes sense. I'm concerned that this will be disruptive to the cluster overall since the LB's disappearing during normal operation.

Can you talk more about situations when you'd remove existing subnets from the LB?

Mar 22 '24 15:03 nrb

scaled down their control plane from three nodes in three separate AZs to one node in one AZ.

I had skipped past this while reading; given this scenario, I think it's reasonable to expect interruption on the LB's availability. Does that make sense to you, @schoeppi5?

Mar 22 '24 15:03 nrb

Hey @nrb, thanks for your question. Yes, unavailability in this scenario is an expected and accepted behaviour, so we are fine with it. I also don't think there is a lot we could about this issue, since it is a limitation of the NLB itself.

Mar 22 '24 15:03 schoeppi5

Cool, that makes sense. And you're right, we can't do a whole lot about it since it's how NLBs work.

I do think there's a concern that someone removes a subnet and _isn'_t doing such a downscaling operation. I'd like to give this a little more thought in terms of general risk. I'll also add it to the community meeting notes to highlight it for maintainers.

Mar 22 '24 15:03 nrb

Hey @nrb, I wasn't able to join the community meeting on April 8th, but I can see, that you added this PR to the meeting notes. Did you get a chance to discuss this and - if so - what was the outcome?

Were you able to give this topic some more thought? Is there anything I can do to assist?

Apr 12 '24 11:04 schoeppi5

Apologies - the community meeting did not happen on April 8 - we decided to resume the normal schedule on April 15.

I haven't given it a lot more thought, but I think this is fairly low risk given the scenarios where users would change things. I'll still bring it up and double check that others don't have objections.

Apr 12 '24 14:04 nrb

From the community meeting:

Would we be able to guarantee that the NLB will get the same IP?

We may also want to make this an opt-in feature, since it is destructive and could surprise users.

Overall, no objections, though.

Apr 15 '24 16:04 nrb

Any update? So messages like these in my event logs.

May 06 '24 02:05 lsacco-nutreense

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 18 '24 23:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 17 '24 23:09 k8s-triage-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

Cannot remove subnets from NLB due to AWS restriction

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard