AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] Internal load balancer unstable when autoscaling cluster

Open thomastvedt opened this issue 2 years ago • 16 comments

Describe the bug When cluster is autoscaling a service exposed on the internal load balancer stops responding for a short while. It looks like the internal load balancer becomes unstable when cluster is autoscaling.

More details:
We enabled autoscaling on our AKS cluster. Normally this runs at 2 nodes, when traffic increases it is scaled up to 3 for a short while, and then back down to 2 nodes.

We have a redis service hosted in our cluster, exposed as a LoadBalancer service, using annotations to use an internal load balancer:

  service:
    type: LoadBalancer
    ports:
      redis: 6379
    externalTrafficPolicy: Cluster
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-internal: "true"
      service.beta.kubernetes.io/azure-load-balancer-ipv4: 10.111.16.123

This makes our Redis service available outside the cluster, inside our vnet, reachable by a different service that currently runs on a separate virtual machine.

When the cluster auto-scales up / down, we observe Redis timeout exceptions in our service running outside the cluster.

We observe a drop/spike in traffic to our Redis service: image

The Redis pod/instance is not killed when autoscaling.

There is a drop in "health probe status" in the underlying load balancer resource in Azure control panel: image

And data path availability: image

We also observe an event in our cluster when this happens: "Updated load balancer with new hosts".

Expected behavior I expected that the internal load balancer would work 100% also when the cluster is autoscaling.

Environment (please complete the following information):

  • Kubernetes version [e.g. 1.25.5]

How can I troubleshoot this further? Can I view logs from the internal load balancer? Where can I find logs?

thomastvedt avatar May 03 '23 09:05 thomastvedt

Action required from @Azure/aks-pm

ghost avatar Jun 02 '23 16:06 ghost

What kind of network plugin are you using? This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing? It might helpful to get a ticket going so we can take a look

palma21 avatar Jun 02 '23 18:06 palma21

What kind of network plugin are you using?

Network type (plugin): Azure CNI
Kubernetes version: 1.25.5

This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing?

I'm not sure how I can tell if this is what I'm seeing 😅

  • This happens when scaling down from 3 to 2 nodes even if Redis was not on the node that was removed
  • This also happens when scaling up from 2 to 3 nodes even if Redis was not touched / moved

It might helpful to get a ticket going so we can take a look

That would be great, however, I'll have to check with management if we can purchase a support plan first ;)

image

In the mean time, please let me know if there is any more information from me that could be helpful 🙏

thomastvedt avatar Jun 05 '23 13:06 thomastvedt

Action required from @Azure/aks-pm

ghost avatar Jun 10 '23 16:06 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jun 25 '23 18:06 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jul 11 '23 00:07 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jul 26 '23 06:07 ghost

We disabled node autoscaling because of this issue, we are still very interested in a fix for this as we could run on two nodes most of the time and 3 nodes during work hours.

thomastvedt avatar Feb 19 '24 15:02 thomastvedt

There were some changes in the probe setup at the cloud-provider-azure level, check out https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#custom-load-balancer-health-probe You might want to precise the default path for the probe

aslafy-z avatar Mar 06 '24 08:03 aslafy-z

+1

Yagya-omnius avatar Aug 21 '24 14:08 Yagya-omnius

+1

Jeishan-omnius avatar Aug 22 '24 08:08 Jeishan-omnius

Hi @thomastvedt did you get chance to try the custom load balancer health probe as @aslafy-z suggested?

AllenWen-at-Azure avatar Aug 29 '24 16:08 AllenWen-at-Azure

Hi @thomastvedt did you get chance to try the custom load balancer health probe as @aslafy-z suggested?

No sorry I didn't. We still have cluster auto scaling disabled and lock the cluster on 3 nodes even though 2 is enough most of the time. We also plan to move the service that lives outside of k8s inside the cluster which would our need for a stable load balancer when scaling the cluster.

If you get a chance to test it I'm interested in how it works out!

thomastvedt avatar Sep 02 '24 09:09 thomastvedt

Could you please set externaltrafficpolicy to local?

MartinForReal avatar Sep 04 '24 16:09 MartinForReal

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. @chasewilson, @nilo19

Well, it's still a issue, and still a pretty bad issue in my opinion..

thomastvedt avatar Feb 19 '25 22:02 thomastvedt

Still a issue, but it looks like it's not prioritized?

thomastvedt avatar Mar 26 '25 21:03 thomastvedt

It's hard to explain why the health probe drops when scaling up. Maybe someone from load balancer team can tell. @thomastvedt could you please try the new behavior of health probe, we change to probe kube-proxy directly instead of backend application. This can solve the issue and will introduce no harm/behavior change for your existing workloads.

  1. update azure-cli-extension aks-preview
  2. az aks update --cluster-service-load-balancer-health-probe-mode Shared to disable and use existing health probe behavior: az aks update --cluster-service-load-balancer-health-probe-mode Servicenodeport

nilo19 avatar Apr 28 '25 00:04 nilo19

It's hard to explain why the health probe drops when scaling up. Maybe someone from load balancer team can tell. @thomastvedt could you please try the new behavior of health probe, we change to probe kube-proxy directly instead of backend application. This can solve the issue and will introduce no harm/behavior change for your existing workloads.

  1. update azure-cli-extension aks-preview
  2. az aks update --cluster-service-load-balancer-health-probe-mode Shared to disable and use existing health probe behavior: az aks update --cluster-service-load-balancer-health-probe-mode Servicenodeport

Hi, we plan to upgrade the cluster in 1-2 months. I'll see if we can revisit this and enable autoscaling again while we're at it. We don't have the capacity to test it out before then, unfortunately 🤔

thomastvedt avatar Apr 28 '25 21:04 thomastvedt

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. Please review @chasewilson, @nilo19.

This issue will now be closed because it hasn't had any activity for 7 days after stale. @thomastvedt feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.