litellm [Feature]: Request prioritization + tiers of endpoints

The Feature

Request prioritization = Allow completion requests to be tagged and then prioritize the request to a specific endpoint based on the tag. For example, pass a parameter tags=["p1"] to router.acompletion(). All "p1" requests should go to GPT-4 Turbo. For "p2" requests, they should go to GPT-4 Turbo only if less than 50% of the rate-limits have been exhausted. Else, go to GPT-4. Some other smarter prioritization algorithm can be used is we can think of one
Tiers of endpoints = For example: (1) Tier 1 has 2 models on Azure across 2 different regions; we load balance across them evenly until rate-limits are hit. Then we move on to tier 2 (2) Tier 2 has accounts on OpenAI; we again load balance across them evenly

Motivation, pitch

Request prioritization = Users often have applications that are sensitive to certain factors. For example, sensitive to latency or a very high-paying client. We need to be able to prioritize these accordingly
Tiers of endpoints = Both OpenAI and Azure offered dedicated clusters which are effectively higher performance endpoints. There is thus a common production use-case where users which to exhaust the rate-limits of the high-performance cluster first before moving on to the others. Also, they may be multiple deployments in Azure across different regions, so we put those all in the same tier and load balance across them evenly

Twitter / LinkedIn details

No response

Jan 15 '24 22:01 georgeseifada

@georgeseifada thanks for this awesome issue !

What's the ideal way you'd like to define priorities to models ?
Is endpoint tears, basically saying send all requests to tier 1 until they error out and then move to tier 2? Can you show an example on how you'd like to set this?

Jan 16 '24 05:01 ishaan-jaff

Hey @ishaan-jaff. Really great library btw.

Below is an example code (adapted from the quickstart) that shows both desired functionalities in one example. I think my mention of GPT-4 Turbo may not have been super clear, but the example shows more-so what we're looking for. I have some notes after the example as well.

from litellm import Router

# Tiers of endpoints
tier_1 = [{ 
    "model_name": "gpt-4",  # Higher priority model
    "priority_tag": 1,  
    "priority_tpm_threshold": 0.5
    "litellm_params": {
        "model": "azure/super-fast",
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }, 
    "tpm": 100000,
    "rpm": 10000,
}, {
    "model_name": "gpt-4", 
    "litellm_params": {
        "model": "azure/regular", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }, 
    "tpm": 100000,
    "rpm": 10000,
}]

tier_2 = [{ 
    "model_name": "gpt-3.5-turbo",  # Lower priority model
    "priority_level": 2,  
    "priority_tpm_threshold": 0.5,
    "litellm_params": {
        "model": "gpt-3.5-turbo", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }, 
    "tpm": 100000,
    "rpm": 10000,
}]

# Ordered list of tiers
model_list = [tier_1, tier_2]

router = Router(model_list=model_list)

# Because this request has priority_level=1, it will first go to "azure/super-fast"
# If that fails or the rate limits are exhausted, it will go to "azure/regular"
# If that fails too or the rate limits are exhausted, it will go to "gpt-3.5-turbo"
response = await router.acompletion(model="gpt-4", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}],
                priority_level=1)  

# Because this request has no priority_level, it will only go to the "azure/super-fast" if the current TPM usage is less that 50%
# If the TPM usage is more than 50% on the "azure/super-fast", it will go to "azure/regular"
# If that fails or the rate limits are exhausted, it will go to "gpt-3.5-turbo"
response = await router.acompletion(model="gpt-4", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}]) 

# Because this request has priority_level=2, it will first go to "gpt-3.5-turbo" directly
# If that fails or the rate limits are exhausted, then the whole request fails
response = await router.acompletion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}],
                priority_level=2)  

print(response)

The main idea here is that Azure and OpenAI offer some dedicated deployment clusters for their models. These can be more stable than the regular endpoints, so we wish to prioritize certain important requests to them first
I can see a further potential use-case / flag. Sometimes, a user is OK with sending a requests to a priority_level=1 model, even if their request is not tagged with priority_level=1, as long as there's some TPM / RPM bandwidth. However, there may be another case where only requests marked with priority_level=1 will be sent to the corresponding endpoint(s)
Perhaps there is already a way to accomplish these features with the current functionality of the library. If so, can you please share an example code?

Thanks!

Jan 16 '24 15:01 georgeseifada

Is endpoint tears, basically saying send all requests to tier 1 until they error out and then move to tier 2?

To answer this question specifically. I don't want to wait until tier 1 errors out necessarily. Rather, we can send requests to the tier 1 as long as:

No errors, which you pointed out
We have not exceeded the RPM or TPM limits. This is important because for requests that are not marked with priority_level=1, we may only want to give them, let's say 50k TPM, even if the endpoint can handle 500k TPM, because we want to leave the rest of the bandwidth for the higher priority requests marked with priority_level=1.

Jan 16 '24 23:01 georgeseifada

Hi @georgeseifada this is a really interesting issue- I want to make sure we solve your problem well. Any chance you can hop on a call sometime this week ?

Sharing my cal for your convenience https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat

Jan 18 '24 15:01 ishaan-jaff

Summarizing call outcomes:

requirements:
- request prioritization: different load balancing groups
- multi-layered fallback: bug -> now fixed
- allow controllable cooldown times: added, pending docs https://docs.litellm.ai/docs/routing#cooldowns (@ishaan-jaff)
- add setting timeouts per model to router docs @krrishdholakia

@georgeseifada is this issue now resolved?

Jan 22 '24 16:01 krrishdholakia

Almost, but not 100%. @ishaan-jaff and I are still investigating the bug for multi-layered fallbacks.

Jan 22 '24 16:01 georgeseifada

@georgeseifada is this now solved?

Feb 06 '24 22:02 krrishdholakia

@georgeseifada is this now solved?

Yes!

Feb 06 '24 23:02 georgeseifada

Hey @georgeseifada request prioritization is now live - https://docs.litellm.ai/docs/scheduler

is this what you wanted?

Jun 02 '24 02:06 krrishdholakia

Very nice! I will share this with the team 🙏

Jun 03 '24 14:06 georgeseifada

thanks @georgeseifada it's in beta, so any feedback here would help

Jun 03 '24 14:06 krrishdholakia