litellm
litellm copied to clipboard
[Feature]: New Model - Azure PTUs
The Feature
https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput
- support basic completion/embedding
- cost tracking for azure ptus completion/embedding
@ishaan-jaff @krrishdholakia I have a mix of PTUs and PAYG models for the exact same model variant, it'd be great if I could give a priority in the litellm_params
so that the router ensures to pick the PTU instances before others, or just a flag in litellm_params
indicating this model is a PTU vs. not and router should prioritize picking the PTU first. The tpm
/rpm
in litellm_params
is not going to work here as PTUs don't have set tpm/rpm on them. I think we need an updated simple-shuffle
routing strategy where you'd first pick randomly from all PTUs and only if all PTUs have failed then pick randomly from non-PTUs.
@taralika why not give the PTUs an arbitrary higher rpm/tpm so they're picked more often
from litellm import Router
import asyncio
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/PTU_MODEL, # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE"),
"rpm": 9000, # requests per minute for this API
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/regular",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE"),
"rpm": 10,
}
}
}]
# init router
router = Router(model_list=model_list, routing_strategy="simple-shuffle")
in simple-shuffle
we only use the rpm/tpm
to define the how often they should be picked
I'm open to adding a PTU flag but would love to understand why it should exist
in simple-shuffle we only use the rpm/tpm to define the how often they should be picked
can you share a bit more about this logic?
I'm open to using rpm
and giving it an arbitrarily high number (like 9000
in your example) for PTU models, however, it'd be great if I don't have to go and set a "low" rpm
(like 10
in your example) on every single non-PTU model.
@taralika we perform a weighted pick: https://github.com/BerriAI/litellm/blob/de3e642999b13ac855e4ce5d77a2af45bd9a5d39/litellm/router.py#L2958
So you would not need to set a "low" rpm on every non PTU
I'd love to setup a support channel and learn how we can improve litellm for you, would you bee free for a call sometime this week? what's the best email to setup a call?
if it's easier here's a link to my cal https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version?month=2024-04
this makes sense, thank you so much for the prompt response! I'll schedule something to connect further.
@ishaan-jaff
https://github.com/BerriAI/litellm/blob/de3e642999b13ac855e4ce5d77a2af45bd9a5d39/litellm/router.py#L2958
Am I reading this code correctly that any model whose rpm is set needs to be at the beginning of the list in the proxy-config.yaml
? otherwise it doesn't use the rpm until healthy_deployments[0]
has rpm
listed?
So instead of checking only in the first deployment, like the code does today:
rpm = healthy_deployments[0].get("litellm_params").get("rpm", None)
might it not be better to do something like this to check all deployments, so that the model order doesn't matter:
rpm = any(deployment.get("litellm_params", {}).get("rpm") is not None for deployment in healthy_deployments)
?
closing since we support this