[Feature]: Request prioritization + tiers of endpoints
The Feature
- Request prioritization = Allow completion requests to be tagged and then prioritize the request to a specific endpoint based on the tag. For example, pass a parameter
tags=["p1"]torouter.acompletion(). All"p1"requests should go to GPT-4 Turbo. For"p2"requests, they should go to GPT-4 Turbo only if less than 50% of the rate-limits have been exhausted. Else, go to GPT-4. Some other smarter prioritization algorithm can be used is we can think of one - Tiers of endpoints = For example: (1) Tier 1 has 2 models on Azure across 2 different regions; we load balance across them evenly until rate-limits are hit. Then we move on to tier 2 (2) Tier 2 has accounts on OpenAI; we again load balance across them evenly
Motivation, pitch
- Request prioritization = Users often have applications that are sensitive to certain factors. For example, sensitive to latency or a very high-paying client. We need to be able to prioritize these accordingly
- Tiers of endpoints = Both OpenAI and Azure offered dedicated clusters which are effectively higher performance endpoints. There is thus a common production use-case where users which to exhaust the rate-limits of the high-performance cluster first before moving on to the others. Also, they may be multiple deployments in Azure across different regions, so we put those all in the same tier and load balance across them evenly
Twitter / LinkedIn details
No response
@georgeseifada thanks for this awesome issue !
- What's the ideal way you'd like to define priorities to models ?
- Is endpoint tears, basically saying send all requests to tier 1 until they error out and then move to tier 2? Can you show an example on how you'd like to set this?
Hey @ishaan-jaff. Really great library btw.
Below is an example code (adapted from the quickstart) that shows both desired functionalities in one example. I think my mention of GPT-4 Turbo may not have been super clear, but the example shows more-so what we're looking for. I have some notes after the example as well.
from litellm import Router
# Tiers of endpoints
tier_1 = [{
"model_name": "gpt-4", # Higher priority model
"priority_tag": 1,
"priority_tpm_threshold": 0.5
"litellm_params": {
"model": "azure/super-fast",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
},
"tpm": 100000,
"rpm": 10000,
}, {
"model_name": "gpt-4",
"litellm_params": {
"model": "azure/regular",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
},
"tpm": 100000,
"rpm": 10000,
}]
tier_2 = [{
"model_name": "gpt-3.5-turbo", # Lower priority model
"priority_level": 2,
"priority_tpm_threshold": 0.5,
"litellm_params": {
"model": "gpt-3.5-turbo",
"api_key": os.getenv("OPENAI_API_KEY"),
},
"tpm": 100000,
"rpm": 10000,
}]
# Ordered list of tiers
model_list = [tier_1, tier_2]
router = Router(model_list=model_list)
# Because this request has priority_level=1, it will first go to "azure/super-fast"
# If that fails or the rate limits are exhausted, it will go to "azure/regular"
# If that fails too or the rate limits are exhausted, it will go to "gpt-3.5-turbo"
response = await router.acompletion(model="gpt-4",
messages=[{"role": "user", "content": "Hey, how's it going?"}],
priority_level=1)
# Because this request has no priority_level, it will only go to the "azure/super-fast" if the current TPM usage is less that 50%
# If the TPM usage is more than 50% on the "azure/super-fast", it will go to "azure/regular"
# If that fails or the rate limits are exhausted, it will go to "gpt-3.5-turbo"
response = await router.acompletion(model="gpt-4",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
# Because this request has priority_level=2, it will first go to "gpt-3.5-turbo" directly
# If that fails or the rate limits are exhausted, then the whole request fails
response = await router.acompletion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}],
priority_level=2)
print(response)
- The main idea here is that Azure and OpenAI offer some dedicated deployment clusters for their models. These can be more stable than the regular endpoints, so we wish to prioritize certain important requests to them first
- I can see a further potential use-case / flag. Sometimes, a user is OK with sending a requests to a
priority_level=1model, even if their request is not tagged withpriority_level=1, as long as there's some TPM / RPM bandwidth. However, there may be another case where only requests marked withpriority_level=1will be sent to the corresponding endpoint(s) - Perhaps there is already a way to accomplish these features with the current functionality of the library. If so, can you please share an example code?
Thanks!
Is endpoint tears, basically saying send all requests to tier 1 until they error out and then move to tier 2?
To answer this question specifically. I don't want to wait until tier 1 errors out necessarily. Rather, we can send requests to the tier 1 as long as:
- No errors, which you pointed out
- We have not exceeded the RPM or TPM limits. This is important because for requests that are not marked with
priority_level=1, we may only want to give them, let's say 50k TPM, even if the endpoint can handle 500k TPM, because we want to leave the rest of the bandwidth for the higher priority requests marked withpriority_level=1.
Hi @georgeseifada this is a really interesting issue- I want to make sure we solve your problem well. Any chance you can hop on a call sometime this week ?
Sharing my cal for your convenience https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat
Summarizing call outcomes:
- requirements:
- request prioritization: different load balancing groups
- multi-layered fallback: bug -> now fixed
- allow controllable cooldown times: added, pending docs https://docs.litellm.ai/docs/routing#cooldowns (@ishaan-jaff)
- add setting timeouts per model to router docs @krrishdholakia
@georgeseifada is this issue now resolved?
Almost, but not 100%. @ishaan-jaff and I are still investigating the bug for multi-layered fallbacks.
@georgeseifada is this now solved?
@georgeseifada is this now solved?
Yes!
Hey @georgeseifada request prioritization is now live - https://docs.litellm.ai/docs/scheduler
is this what you wanted?
Very nice! I will share this with the team 🙏
thanks @georgeseifada it's in beta, so any feedback here would help