Inference traffic load balancing
Thank you for your valuable work! I would like to know what is the design idea of the inference traffic load balancing and where is the code.
@justinSmileDate Thanks for the interest.
By default we take advantage of the k8s service to distribute the inference traffic. Additionally we do have a more complicated design of and support for difference ingress providers.
Please take a look at this for more detail. https://github.com/sgl-project/ome/tree/main/pkg/controller/v1beta1/inferenceservice/reconcilers/ingress
depending on the vendor out of the box, it uses k8s round robin, in our runtimes we use sgl router for better load balancing as well as for pd load balancing
@justinSmileDate Thanks for the interest. By default we take advantage of the k8s service to distribute the inference traffic. Additionally we do have a more complicated design of and support for difference ingress providers. Please take a look at this for more detail. https://github.com/sgl-project/ome/tree/main/pkg/controller/v1beta1/inferenceservice/reconcilers/ingress
Thank you for your professional reply! In fact, I want to understand how 'opts' is generated. I think the generation of 'opts' should include the strategy of traffic distribution. I don't see any part about traffic distribution in the project. Can you tell me where it is?
The ‘opts’ code is as follows: https://github.com/sgl-project/ome/blob/main/pkg/controller/v1beta1/inferenceservice/reconcilers/ingress/reconciler.go#L90
depending on the vendor out of the box, it uses k8s round robin, in our runtimes we use sgl router for better load balancing as well as for pd load balancing
Thank you for your professional reply! I roughly understand the routing strategy, but I want to know, at the “front door”, what is the strategy for sending traffic to different pods? Is the k8s scheduling strategy used?