azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Load Balancing Azure OpenAI using Application Gateway

Open vrajroutu opened this issue 1 year ago • 4 comments

When deploying in a production environment, it's important to be aware of potential rate limits. For Azure OpenAI, there are specific limits in place: GPT-3.5 models have a maximum capacity of 240,000 transactions per minute (TPM), while GPT-4 models are limited to 60,000 TPM. To address these limitations, a viable strategy is to employ multiple Azure OpenAI instances distributed across different regions. These instances can then be accessed through a load balancer, helping to manage and distribute the incoming requests effectively

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [X ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

vrajroutu avatar Aug 20 '23 02:08 vrajroutu

Reference : https://www.raffertyuy.com/raztype/azure-openai-load-balancing/

vrajroutu avatar Aug 22 '23 17:08 vrajroutu

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

github-actions[bot] avatar Oct 22 '23 01:10 github-actions[bot]

@vrajroutu i'm the maintainer of LiteLLM we allow you to do this today using the litellm router - load balance between multiple deployments (Azure, OpenAI) I'd love your feedback if this does not solve your problem

Here's how to use it Docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

ishaan-jaff avatar Nov 20 '23 21:11 ishaan-jaff

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

github-actions[bot] avatar Jan 28 '24 01:01 github-actions[bot]