olla Add support for weight-based load balancing in Olla

Proposed Feature:

Add a weight option for endpoints.

Weight could be applied to:

round_robin → classic weighted round-robin

least_conn → weighted least-connections

priority → within each priority tier, distribute traffic according to weight (two endpoint with same priority but different capacity)

Nov 17 '25 07:11 Mahdi-A98

Thanks for the suggestion! Would love to know more about the usecase so we can make the implementation work well.

What's your current setup where you need weights? (eg. are you mixing GPU variations - H100 vs GB / RTX 6000 vs RTX 5090 etc)
What other differences are you trying to balance - network routing, links?
How are you currently working around this? (duplicating URLs, multiple instances, etc?)
Which load balancer are you using right now as your primary?

I'm wondering whether based on your suggestion, if something like this (simplified) would work:

endpoints:
        - url: "http://h200-node-1:8000"
          name: "primary-h200"
          type: "vllm"
          priority: 100
          weight: 400

        - url: "http://h100-node-1:8000"
          name: "h100-cluster-1"
          type: "vllm"
          priority: 100
          weight: 250 

        - url: "http://h100-node-2:8000"
          name: "h100-cluster-2"
          type: "vllm"
          priority: 100
          weight: 250

        - url: "http://l40s-rack-1:8000"
          name: "inference-l40s-1"
          type: "vllm"
          priority: 100
          weight: 100

This is obviously hardware derived balancing.

Nov 18 '25 03:11 thushan

Thanks very much for considering this feature. In our setup, our product connects to multiple Ollama servers, and several of them run the same models. Since the servers have different hardware, capacities, and sometimes different network latencies, it would be really helpful to be able to assign a weight to each endpoint so traffic can be balanced according to their capabilities.

We are currently in the development stage. I initially tried using duplicate URLs, but Olla only accepts the last duplicate. I then set up a reverse proxy with different domains, which worked and Olla distributed traffic equally between them. I also tried the LiteLLM load balancer, which uses latency-based balancing—but this approach sends all requests to the endpoint with the lowest latency, leaving others unused. Additionally, LiteLLM doesn’t support per-endpoint configuration.

Also for our use case, having a load-balancer API (or administrative API) to dynamically add or remove endpoints would be extremely valuable.

thank you.

Nov 18 '25 11:11 Mahdi-A98

I also want to mention that the sample configuration you provided would work well for my setup.

Nov 18 '25 11:11 Mahdi-A98