Add support for weight-based load balancing in Olla
Proposed Feature:
Add a weight option for endpoints.
Weight could be applied to:
round_robin → classic weighted round-robin
least_conn → weighted least-connections
priority → within each priority tier, distribute traffic according to weight (two endpoint with same priority but different capacity)
Thanks for the suggestion! Would love to know more about the usecase so we can make the implementation work well.
- What's your current setup where you need weights? (eg. are you mixing GPU variations - H100 vs GB / RTX 6000 vs RTX 5090 etc)
- What other differences are you trying to balance - network routing, links?
- How are you currently working around this? (duplicating URLs, multiple instances, etc?)
- Which load balancer are you using right now as your primary?
I'm wondering whether based on your suggestion, if something like this (simplified) would work:
endpoints:
- url: "http://h200-node-1:8000"
name: "primary-h200"
type: "vllm"
priority: 100
weight: 400
- url: "http://h100-node-1:8000"
name: "h100-cluster-1"
type: "vllm"
priority: 100
weight: 250
- url: "http://h100-node-2:8000"
name: "h100-cluster-2"
type: "vllm"
priority: 100
weight: 250
- url: "http://l40s-rack-1:8000"
name: "inference-l40s-1"
type: "vllm"
priority: 100
weight: 100
This is obviously hardware derived balancing.
Thanks very much for considering this feature. In our setup, our product connects to multiple Ollama servers, and several of them run the same models. Since the servers have different hardware, capacities, and sometimes different network latencies, it would be really helpful to be able to assign a weight to each endpoint so traffic can be balanced according to their capabilities.
We are currently in the development stage. I initially tried using duplicate URLs, but Olla only accepts the last duplicate. I then set up a reverse proxy with different domains, which worked and Olla distributed traffic equally between them. I also tried the LiteLLM load balancer, which uses latency-based balancing—but this approach sends all requests to the endpoint with the lowest latency, leaving others unused. Additionally, LiteLLM doesn’t support per-endpoint configuration.
Also for our use case, having a load-balancer API (or administrative API) to dynamically add or remove endpoints would be extremely valuable.
thank you.
I also want to mention that the sample configuration you provided would work well for my setup.