Ensuring LoadBalancer performance issue
What happened: We have a problem with the performance of LoadBalancing rules when using AKS/LoadBalancer services. For context, let me explain what our system requires: We have a multi-region product with clusters all around the globe. Customers can create routes of video streaming connecting those regions and when a route is started, pods are provisioned in those regions with video endpoints (input or output). So we need to create a LoadBalancer service for each endpoint with their load balancing rules for the specific ports of the video connection. We also need that the provision time of all components to be relatively low (we are talking here about 1 minute or less, although we can live with a worse-case of 2m/3m). Everything works fine when launching individual routes that creates an endpoint in the source region and another one in the destination region, although we noticed that when we need to create a new LoadBalancer/Public IP sometimes it takes more time that expected. In order to solve this we pre-reserve LoadBalancer services/Public IP's so the provision time is much lower, of course. The problem though, it is not the creation of the Public IP's in Azure but the configuration of the Azure LB (ALB) to redirect the connections to our cluster. We realized that all operations done in an ALB are serialized and they take an average of 20 seconds. On top of that, when we modify a service in AKS, even though the ports don't change (so there is no need to change any load balancer rules in ALB), the operations are also queued and they take the usual 20 secs. This is a huge problems for use cases where customers need to start up a lot of routes (endpoints) in one region at once (we have use cases for disaster recovery where all routes must be available in 2m-3m) because all the operations are serialized so the latest ones take a lot of time to be ready. If the customer provision 50 endpoints that makes 50*20 seconds = 1000 seconds or 16 minutes! This is worse if we take into account that other customers can also have pending operations in that same region when the load balancing rules are requested to ALB, so the total provision time can be 30 minutes or even hours. We don't know if the operations are throttled or serialized for any reason or if there is any possibility of parallelizing them (or even group them so in one go the LoadBalancer can be configured with all the rules). If not, that makes AKS unusable for our use cases.
What you expected to happen: We expect operations on the LoadBalancer to be parallelized/group and the provision time reduced.
How to reproduce it (as minimally and precisely as possible): The serialization of LoadBalancer operations can be checked looking at the Events in Kubernetes. No operation is started until the last one has finished:
kubectl get events | grep service
31s Normal EnsuredLoadBalancer service/public-ip-d10 Ensured load balancer
53s Normal EnsuringLoadBalancer service/public-ip-d10 Ensuring load balancer
53s Normal EnsuredLoadBalancer service/public-ip-b11 Ensured load balancer
77s Normal EnsuringLoadBalancer service/public-ip-b11 Ensuring load balancer
77s Normal EnsuredLoadBalancer service/public-ip-2c1 Ensured load balancer
101s Normal EnsuringLoadBalancer service/public-ip-2c1 Ensuring load balancer
101s Normal EnsuredLoadBalancer service/public-ip-cb7 Ensured load balancer
2m8s Normal EnsuringLoadBalancer service/public-ip-cb7 Ensuring load balancer
2m8s Normal EnsuredLoadBalancer service/public-ip-6e2 Ensured load balancer
2m34s Normal EnsuringLoadBalancer service/public-ip-6e2 Ensuring load balancer
2m34s Normal EnsuredLoadBalancer service/public-ip-d54 Ensured load balancer
2m44s Normal EnsuringLoadBalancer service/public-ip-d54 Ensuring load balancer
In this example, 6 LoadBalancer has been changed at the same time, and there is a different of 2m13s between the first service operation until the last service is finished. Of course we checked that the connection to service/public-ip-d10 doesn't work until the "Ensured load balancer" event is received.
Anything else we need to know?:
Environment:
-
Kubernetes version (use
kubectl version): Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.10", GitCommit:"41d24ec9c736cf0bdb0de3549d30c676e98eebaf", GitTreeState:"clean", BuildDate:"2021-01-18T09:12:27Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"} -
Size of cluster: Two pools of 2, 2 nodes with autoscaling, although we tested in different cluster sizes.
-
General description of workloads in the cluster: Dynamic pods with Public UDP/TCP endpoints for stream processing and routing.
Hi kortatu, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such:
- If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
- Please abide by the AKS repo Guidelines and Code of Conduct.
- If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
- Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
- Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
- If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
Triage required from @Azure/aks-pm
Action required from @Azure/aks-pm
The LB operations are actually sent in parallel. But can only be executed in sequence by the LB. We are working with the Azure LB team to improve this.
As @palma21 said, we are still working on improving functionality with the ALB team. However, if you're only targeting a single pod behind each port, you could consider following more of a game server operational mode: put public IPs on all of your nodes, then expose your services with nodeport type and provide the public IP and port back. This allows you to expose your services publicly with no waiting for the ALB.
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads