[BUG] Internal load balancer unstable when autoscaling cluster
Describe the bug When cluster is autoscaling a service exposed on the internal load balancer stops responding for a short while. It looks like the internal load balancer becomes unstable when cluster is autoscaling.
More details:
We enabled autoscaling on our AKS cluster. Normally this runs at 2 nodes, when traffic increases it is scaled up to 3 for a short while, and then back down to 2 nodes.
We have a redis service hosted in our cluster, exposed as a LoadBalancer service, using annotations to use an internal load balancer:
service:
type: LoadBalancer
ports:
redis: 6379
externalTrafficPolicy: Cluster
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service.beta.kubernetes.io/azure-load-balancer-ipv4: 10.111.16.123
This makes our Redis service available outside the cluster, inside our vnet, reachable by a different service that currently runs on a separate virtual machine.
When the cluster auto-scales up / down, we observe Redis timeout exceptions in our service running outside the cluster.
We observe a drop/spike in traffic to our Redis service:

The Redis pod/instance is not killed when autoscaling.
There is a drop in "health probe status" in the underlying load balancer resource in Azure control panel:

And data path availability:

We also observe an event in our cluster when this happens: "Updated load balancer with new hosts".
Expected behavior I expected that the internal load balancer would work 100% also when the cluster is autoscaling.
Environment (please complete the following information):
- Kubernetes version [e.g. 1.25.5]
How can I troubleshoot this further? Can I view logs from the internal load balancer? Where can I find logs?
Action required from @Azure/aks-pm
What kind of network plugin are you using? This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing? It might helpful to get a ticket going so we can take a look
What kind of network plugin are you using?
Network type (plugin): Azure CNI
Kubernetes version: 1.25.5
This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing?
I'm not sure how I can tell if this is what I'm seeing 😅
- This happens when scaling down from 3 to 2 nodes even if Redis was not on the node that was removed
- This also happens when scaling up from 2 to 3 nodes even if Redis was not touched / moved
It might helpful to get a ticket going so we can take a look
That would be great, however, I'll have to check with management if we can purchase a support plan first ;)
In the mean time, please let me know if there is any more information from me that could be helpful 🙏
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
We disabled node autoscaling because of this issue, we are still very interested in a fix for this as we could run on two nodes most of the time and 3 nodes during work hours.
There were some changes in the probe setup at the cloud-provider-azure level, check out https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#custom-load-balancer-health-probe You might want to precise the default path for the probe
+1
+1
Hi @thomastvedt did you get chance to try the custom load balancer health probe as @aslafy-z suggested?
Hi @thomastvedt did you get chance to try the custom load balancer health probe as @aslafy-z suggested?
No sorry I didn't. We still have cluster auto scaling disabled and lock the cluster on 3 nodes even though 2 is enough most of the time. We also plan to move the service that lives outside of k8s inside the cluster which would our need for a stable load balancer when scaling the cluster.
If you get a chance to test it I'm interested in how it works out!
Could you please set externaltrafficpolicy to local?
This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. @chasewilson, @nilo19
Well, it's still a issue, and still a pretty bad issue in my opinion..
Still a issue, but it looks like it's not prioritized?
It's hard to explain why the health probe drops when scaling up. Maybe someone from load balancer team can tell. @thomastvedt could you please try the new behavior of health probe, we change to probe kube-proxy directly instead of backend application. This can solve the issue and will introduce no harm/behavior change for your existing workloads.
- update azure-cli-extension
aks-preview - az aks update --cluster-service-load-balancer-health-probe-mode Shared to disable and use existing health probe behavior: az aks update --cluster-service-load-balancer-health-probe-mode Servicenodeport
It's hard to explain why the health probe drops when scaling up. Maybe someone from load balancer team can tell. @thomastvedt could you please try the new behavior of health probe, we change to probe kube-proxy directly instead of backend application. This can solve the issue and will introduce no harm/behavior change for your existing workloads.
- update azure-cli-extension
aks-preview- az aks update --cluster-service-load-balancer-health-probe-mode Shared to disable and use existing health probe behavior: az aks update --cluster-service-load-balancer-health-probe-mode Servicenodeport
Hi, we plan to upgrade the cluster in 1-2 months. I'll see if we can revisit this and enable autoscaling again while we're at it. We don't have the capacity to test it out before then, unfortunately 🤔
This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. Please review @chasewilson, @nilo19.
This issue will now be closed because it hasn't had any activity for 7 days after stale. @thomastvedt feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.