cluster-api-provider-aws NLB support

NLB support

Open carlosonunez opened this issue 2 years ago • 16 comments

/kind feature

Describe the solution you'd like [A clear and concise description of what you want to happen.]

As an engineer looking to deploy Kubernetes with CAPA I would like my control plane to be fronted by a Network Load Balancer instead of Classic ELBs So that I can continue to use CAPA after AWS deprecates them

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Classic ELBs are getting deprecated in August 2022, and AWS SDK for Go supports NLBs. We should use them!

More info here

Jan 20 '22 21:01 carlosonunez

@carlosonunez: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 20 '22 21:01 k8s-ci-robot

Thanks for bringing this up @carlosonunez I think we are tracking this somewhere else, will find out and link these.

There is an ongoing test to see how to deal with hairpinning issue comes with NLBs (when LB directs Kubelet traffic to the api-server on the same node).

Jan 20 '22 21:01 sedefsavas

That's great that this is being tracked elsewhere! I couldn't find any issues pertaining to it and thought "That can't be right!" :)

Jan 20 '22 21:01 carlosonunez

Any update on whether this truly is being tracked? thanks

Jan 24 '22 17:01 cbusch-pivotal

I haven't seen issue tracking NLB support, but we considered using NLB back in 2018: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/115#discussion_r220562864

Jan 24 '22 18:01 dlipovetsky

There is an ongoing test to see how to deal with hairpinning issue comes with NLBs (when LB directs Kubelet traffic to the api-server on the same node).

For reference, CAPZ also has a hairpinning issue, although specifically when using an internal load balancer: https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/1033#issuecomment-1007789608. The current workaround is to prevent requests from kubelet on the control plane machines from going through the LB, by resolving NLB DNS record to 127.0.0.1.

Jan 24 '22 19:01 dlipovetsky

I could not find a specific issue for this, I see this was discussed in CAPA v1alpha4 API changes document but during today's office hours, we triaged this issue and keeping this issue.

Jan 24 '22 22:01 sedefsavas

For reference, CAPZ also has a hairpinning issue, although specifically when using an internal load balancer: kubernetes-sigs/cluster-api-provider-azure#1033 (comment). The current workaround is to prevent requests from kubelet on the control plane machines from going through the LB, by resolving NLB DNS record to 127.0.0.1.

Thanks @dlipovetsky, good to know there is an example workaround. If I can't make it work by disabling client IP preservation, this might be the way to go.

Jan 24 '22 22:01 sedefsavas

As suggested by @sedefsavas, changing the target group type from instance to IP, can also workaround this issue https://aws.amazon.com/premiumsupport/knowledge-center/target-connection-fails-load-balancer/ But the downside of this is the IP of instance might change during lifecycle, the corresponding target group needs to be updated, also the NLB's listeners needs to be updated as well. This could cause downtime.

Update With instance type target, I can still curl the NLB endpoint within control plane, with appropriate SecurityGroup

Jan 25 '22 22:01 lubronzhan

Just a correction on retirement of Classic load balancers mentioned in the issue:

EC2 Classic Networking is being retired not the Classic Load Balancers. Classic Load Balancers that are being used in EC2 Classic networking will need to be migrated to an LB in a VPC. If Classic Load Balancers are already in a VPC, they won't be affected.

TL;DR August 2022 retirement date is not for Classic ELBs.

Feb 17 '22 07:02 sedefsavas

UPDATE: Never mind. This deprecation only applies to ELBs in EC2-Classic, as @sedefsavas said. It's a bit confusing since this article recommends migrating to ELBv2 resources.

The language is SUPER confusing, but according to this AWS article, ELBs (aka Classic LBs) are in scope for retirement as well: https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/migrate-classic-load-balancer.html#migrate-step-by-step-classiclink

On Feb 17, 2022, 01:59 -0600, sedefsavas @.***>, wrote:

Just a correction on retirement of Classic load balancers mentioned in the issue: EC2 Classic Networking is being retired not the Classic Load Balancers. Classic Load Balancers that are being used in EC2 Classic networking will need to be migrated to an LB in a VPC. If Classic Load Balancers are already in a VPC, they won't be affected. TL;DR August 2022 retirement date is not for Classic ELBs. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>

Feb 17 '22 15:02 carlosonunez

This feature(NLB for control-plane) is very much required in the market irrespective of the CLB still being there. Please let me know if there is a need for a contribution to this.

Mar 01 '22 10:03 anmolsachan

Are we planning to convert the existing Classic ELBs to NLBs?

Apr 11 '22 17:04 vincepri

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 10 '22 18:07 k8s-triage-robot

/remove-lifecycle stale

Jul 10 '22 18:07 voor

While implementing this, we can avoid introducing the problem of creating LBs with identical names for clusters created with same name in different namespaces: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/969

NLB names need to be unique in the same AWS account and region.

Jul 28 '22 18:07 sedefsavas

/assign

Oct 19 '22 09:10 Skarlso

Hello all.

So I read through this issue finally. :)

What is the consensus here? Do we just disable IP address preservation, or do we do a similar workaround as Azure did?

@sedefsavas do you have anything on this by any chance? It's okay if not. I'm just gathering as much information as I can. :)

Oct 20 '22 05:10 Skarlso

This comment:

But the downside of this is the IP of instance might change during lifecycle, the corresponding target group needs to be updated, also the NLB's listeners needs to be updated as well. This could cause downtime.

I'm not sure how much of a downside this is, are you implying that when an EC2 instance gets rebooted it might get assigned a new IP Address? That seems like an edge case that might be worth further analysis. However, when new machines are rolled out they need to be added to the target groups anyway -- regardless of ELB or NLB -- and so I don't consider that scenario a "new problem" with NLB specifically.

Oct 20 '22 13:10 voor

If the target type for each target group in the NLB is by instance-id, then this seems like a non-issue?

Oct 20 '22 18:10 carlosonunez-vmw

by instance-id,

It looks like Amazon has provided better clarity on this, and if you turn off Preserve client IP addresses you can have loopback without using IP Addresses explicitly, this however would mean that the API server would always see the NLB's IP Address even for external users, which does not sound good from an audit standpoint.

Sources: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html Also: https://aws.amazon.com/premiumsupport/knowledge-center/target-connection-fails-load-balancer/

Oct 20 '22 19:10 voor

Yeah, we probably will have to do something like what CAPZ did.

Oct 21 '22 07:10 Skarlso

Okay so last calls. We are going to do something like CAPZ in which they edited the hosts file and added references to the lbs ip as localhost. All in favour say yay all against say nay.

Oct 25 '22 04:10 Skarlso

yayOn Oct 24, 2022, at 23:49, Gergely Brautigam @.***> wrote: Okay so last calls. We are going to do something like CAPZ in which they edited the hosts file and added references to the lbs ip as localhost. All in favour say yay all against say nay.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Oct 25 '22 05:10 carlosonunez

Okay so last calls. We are going to do something like CAPZ in which they edited the hosts file and added references to the lbs ip as localhost. All in favour say yay all against say nay.

Just a naive question. Does this affect the kubelet certificate in any way? E.g. that it also has to contain 127.0.0.1 instead only the node IP? Futhermore, would it make sense to make this behavior (map the LB IPs to localhost in /etc/hosts) configurable, as there might be cases where the hairpinning issue is not present and it's desirable that the traffic goes to the API LB instead of the single API server running on the local node.

Oct 25 '22 06:10 johannesfrey

Just a naive question. Does this affect the kubelet certificate in any way? E.g. that it also has to contain 127.0.0.1 instead only the node IP?

Good question. I don't know. :) I will have to consult some documentation. I would assume so; otherwise, the call to localhost might fail with a certificate issue. But also, the call to localhost does not usually require a certificate.

Oct 25 '22 07:10 Skarlso

Futhermore, would it make sense to make this behavior (map the LB IPs to localhost in /etc/hosts) configurable, as there might be cases where the hairpinning issue is not present and it's desirable that the traffic goes to the API LB instead of the single API server running on the local node.

Certainly. :) That was my intention.

Oct 25 '22 07:10 Skarlso

Yeah, since we will add 127.0.0.1 which is the loopback address, I don't think that is required to be added to the certificate. But let's make sure. :)

Oct 25 '22 07:10 Skarlso

I started to flash this out. I'm going to try and keep it super simple at first. Let's see how that goes.

Oct 29 '22 16:10 Skarlso

I started with the basics of just creating the NLB. There are still a couple todos left on it.

Oct 31 '22 10:10 Skarlso

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

NLB support

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard