azure-search-openai-demo Load Balancing Azure OpenAI using Application Gateway

When deploying in a production environment, it's important to be aware of potential rate limits. For Azure OpenAI, there are specific limits in place: GPT-3.5 models have a maximum capacity of 240,000 transactions per minute (TPM), while GPT-4 models are limited to 60,000 TPM. To address these limitations, a viable strategy is to employ multiple Azure OpenAI instances distributed across different regions. These instances can then be accessed through a load balancer, helping to manage and distribute the incoming requests effectively

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [X ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

Aug 20 '23 02:08 vrajroutu

Reference : https://www.raffertyuy.com/raztype/azure-openai-load-balancing/

Aug 22 '23 17:08 vrajroutu

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

Oct 22 '23 01:10 github-actions[bot]

@vrajroutu i'm the maintainer of LiteLLM we allow you to do this today using the litellm router - load balance between multiple deployments (Azure, OpenAI) I'd love your feedback if this does not solve your problem

Here's how to use it Docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

Nov 20 '23 21:11 ishaan-jaff

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

Jan 28 '24 01:01 github-actions[bot]

azure-search-openai-demo azure-search-openai-demo copied to clipboard

Load Balancing Azure OpenAI using Application Gateway

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

azd version?

Versions

Mention any other details that might be useful

azure-search-openai-demo
azure-search-openai-demo copied to clipboard

This issue is for a: (mark with an `x`)