azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Load Balancing Azure OpenAI using Application Gateway
When deploying in a production environment, it's important to be aware of potential rate limits. For Azure OpenAI, there are specific limits in place: GPT-3.5 models have a maximum capacity of 240,000 transactions per minute (TPM), while GPT-4 models are limited to 60,000 TPM. To address these limitations, a viable strategy is to employ multiple Azure OpenAI instances distributed across different regions. These instances can then be accessed through a load balancer, helping to manage and distribute the incoming requests effectively
This issue is for a: (mark with an x
)
- [ ] bug report -> please search issues before submitting
- [X ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
Any log messages given by the failure
Expected/desired behavior
OS and Version?
Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
azd version?
run
azd version
and copy paste here.
Versions
Mention any other details that might be useful
Thanks! We'll be in touch soon.
Reference : https://www.raffertyuy.com/raztype/azure-openai-load-balancing/
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.
@vrajroutu i'm the maintainer of LiteLLM we allow you to do this today using the litellm router - load balance between multiple deployments (Azure, OpenAI) I'd love your feedback if this does not solve your problem
Here's how to use it Docs: https://docs.litellm.ai/docs/routing
from litellm import Router
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-v-2", # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-functioncalling",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ",
"api_key": os.getenv("OPENAI_API_KEY"),
}
}]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response)
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.