cloud-provider-azure Azure instances can become wedged when they are unready and need outbound connections to become ready

Azure instances can become wedged when they are unready and need outbound connections to become ready

Open JoelSpeed opened this issue 1 year ago • 8 comments

What happened:

After upgrading to K8s 1.26, we have noticed that our Azure master instances can become stuck when booting (in-tree) or when upgrading. Specifically, if they need an outbound connection to become ready (Eg pulling a CNI related image), they cannot pull this image.

This is a result of the architecture we are using (using a load balancer for outbound connectivity when connected to an internal load balancer), however, I don't think it's ideal the Azure CCM is knowingly able to break users clusters.

Background

The service controller, has, historically been removing instances from load balancers when it knows that they are unready. The theory here being that if it's unready, it won't be able to serve traffic (I'm not convinced that's actually true but that's a side point). This has been broken in the past and was fixed again (https://github.com/kubernetes-sigs/cloud-provider-azure/pull/1195) in Kube 1.25.

When you are running in Azure, your outbound connectivity is configured in one of a few different ways, we've been using scenario 1 from this document.

When the instance is connected to the public load balancer, it uses this to route traffic out of the internal network and connect to the internet. When you remove it, as the service controller would when it's unready, it will fall back to the default outbound connectivity, so in general, you don't lose internet connectivity.

However, if your instance is connected to an internal load balancer as well, it must either be connected to a public load balancer, or a public IP, or have a NAT gateway available, it cannot use the default outbound connection. Meaning that, when the service controller removes a control plane instances (connected to both internal and external LBs), it loses connectivity to the internet.

If the instance then needs outbound connectivity to become ready again (eg pulling an image) it breaks as it can no longer pull.

There are recommendations to change the architecture, for example, if we migrated to a NAT gateway, this wouldn't matter, however, NAT gateways are much more expensive than the current solution, and, you have to know about the issue, to then be able to mitigate it. Given the severity here, a control plane instance losing connection and being unable to get back to ready, I would expect that it's going to be a very stressful situation for a user if they realise this mid incident.

IMO we should prevent this from happening, rather than relying on users either knowing about this, or finding it while it's already broken.

Until 1.26, this happened infrequently, now, with https://github.com/kubernetes/kubernetes/pull/111663, we are seeing this more often as the service controller is quicker to react to instances

Why are we removing instances from the load balancer?

I think we are removing instances from the load balancer as an optimisation, we know that the instance is unready so it is "likely" that it cannot serve traffic, debatable potentially. But, I think the main reason is because when you use a service with externalTrafficPolicy: Cluster, the health probes configured by Azure currently don't allow you to gracefully shutdown (see #3499). Meaning that instances are removed from service between 5 and 10 seconds after they stop serving traffic. In theory, if the instance goes unready before this, removing it from the load balancer saves you from some disruption there.

What you expected to happen:

The Azure CCM should not remove outbound connectivity from an instance.

How to reproduce it (as minimally and precisely as possible):

Create an Azure cluster using a load balancer for outbound connectivity (default on AKS)
Simulate a scenario where you need the internet to start kubelet or get a node to ready
- This could be requiring a fresh image for the CNI on boot (imagePullPolicy: Always)
- Add a systemd drop in that must prestart before kubelet that downloads some document from the internet
Restart the node
- It should at some point go unready, at which point it will become disconnected from the load balancer by the service controller
- At this point, it is not longer able to reach the internet, above measures prevent it becoming ready again

Anything else we need to know?:

I would like to propose a patch to the CCM code that checks if an instance would lose internet connectivity, if we removed it from the load balancer, and then prevent it being removed if it would lose connectivity.

The decision tree would be

graph TD;
A{Is Node ready?};
A--Yes-->B[Node should be included in LB];
A--No-->C{Is instance attached to internal LB?};
C--No-->D[Node should be removed from LB];
C--Yes-->E{Does instance have a public LB?};
E--Yes-->D;
C--No-->F{Does instance network have a NAT gateway?};
F--Yes-->D;
E--No-->G[Instance would lose internet connectivity];
G-->B;

With this in place, and without any changes to the health probes, we are seeing 8-15 seconds of disruption for externalTrafficPolicy: Cluster services, instead of 3-6 seconds. externalTrafficPolicy: Local seems to be unaffected.

IMO I would rather be disrupted for a few seconds, rather than my instance wedge because it has no internet connectivity.

Environment:

Kubernetes version (use kubectl version): 1.26.2
Cloud provider or hardware configuration: Azure with load balancers being used for outbound connectivity
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools: OpenShift
Network plugin and version (if this is a network-related bug): OVN-Kubernetes
Others:

Mar 06 '23 14:03 JoelSpeed

Thanks for reporting the issue. As documented on https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#standard-loadbalancer, when using SLB as outbound configurations, it is recommended to define the outbound rules with a separate outbound backend address pool. In this way, cloud provider would not touch outbound pool, and meanwhile, it's able to remove the unready nodes from inbound pool.

Besides SLB, using NatGateway or UDR (e.g. firewall) should also work (refer AKS docs here for details).

Mar 07 '23 23:03 feiskyer

Hey @feiskyer thanks for the links to the various different architecture based solutions for this issue, I can see how they work and can see why they would resolve the issue, we will certainly look into upgrading our existing clusters, but in the mean time, I'm concerned about folks who haven't hit this yet.

We don't know how many clusters exist today who only have a single public LB backend pool providing them with outbound connectivity. This bug can be catastrophic and cause a complete cluster outage (we have seen this in some recent testing with a customer).

I wonder, to protect people, who would otherwise only find this issue during an incident, whether we should implement a code solution to prevent the instances losing internet connectivity and possibly log warnings in the code to explain that users should upgrade their architecture

As far as I know, there aren't other providers where the outbound connectivity can become tied to the kubernetes service in the same way, so this would be a surprising behaviour to anyone who comes to this from another platform

Mar 09 '23 12:03 JoelSpeed

@JoelSpeed fully agreed with the concerns. If you didn't notice yet, there was a discussion long time ago at https://github.com/kubernetes/kubernetes/pull/72813, and our conclusion at that time is:

as we discussed offline. The cloud provider should not create the outbound rule (even if it does not exist). The cluster bootstrap operation should take care of that.. This enables users to choose how they want to configure egress path.

Since then, the outbound configurations have been set from provisioning tools (e.g. AKS, capz or aks-engine).

Mar 10 '23 07:03 feiskyer

Thanks for the link, hadn't seen that issue and I agree, I don't think the outbound rules should be added by the cloud provider. But that doesn't necessarily mean we can't fix this issue.

Do you have a good understanding of why we remove instances from the load balancers? I'm not sure we should be. I wrote this originally

I think we are removing instances from the load balancer as an optimisation, we know that the instance is unready so it is "likely" that it cannot serve traffic, debatable potentially. But, I think the main reason is because when you use a service with externalTrafficPolicy: Cluster, the health probes configured by Azure currently don't allow you to gracefully shutdown (see https://github.com/kubernetes-sigs/cloud-provider-azure/issues/3499). Meaning that instances are removed from service between 5 and 10 seconds after they stop serving traffic. In theory, if the instance goes unready before this, removing it from the load balancer saves you from some disruption there.

I think if we were to make sure that the load balancer health probes worked, we wouldn't need to be removing instances form the load balancer and this issue would also be resolved.

I'm not sure why an unready node would ever be removed from the load balancer anyway. When using a service with ETP Local, the service controller only calls update load balancer when a node is either added or removed from the cluster. If you have a service with ETP Cluster, it updates the load balancer whenever there is a change from ready to unready on a node, allowing you to take action upon that event. However, does the readiness of the node really dictate whether it can serve traffic or not? There are many reasons why an instance would go unready, disk pressure or memory pressure for example, in those scenarios, it would probably still be able to serve traffic. It strikes me that the real test of whether an instance can serve traffic, for an ETP Cluster service, should be whether it can route traffic to the available backends or not, which would be based on the health of kube-proxy normally.

I've written more on this in #3499 but I'm wondering, is there a good reason to remove unready nodes, other than as an optimisation for health probes being misconfigured?

Mar 10 '23 09:03 JoelSpeed

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 08 '23 09:06 k8s-triage-robot

/remove-lifecycle stale

Jun 12 '23 10:06 JoelSpeed

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 22 '24 08:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 21 '24 08:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 22 '24 09:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 22 '24 09:03 k8s-ci-robot

cloud-provider-azure cloud-provider-azure copied to clipboard

Azure instances can become wedged when they are unready and need outbound connections to become ready

What happened:

Background

Why are we removing instances from the load balancer?

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

cloud-provider-azure
cloud-provider-azure copied to clipboard