azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Add OpenAI Priority Load Balancer for Azure OpenAI

Open simonkurtz-MSFT opened this issue 9 months ago • 9 comments

This PR introduces the openai-priority-loadbalancer as a native Python option to target one or more Azure OpenAI endpoints. Among the features of the load-balancer are:

  • Minimally necessary code and configuration to add abstracted load-balancing to the OpenAI Python API Library via a custom httpx client.
  • Priority-based load-balancing to address scenarios such as Provisioned Throughput Unit (PTU) over Consumption prioritization.
  • Respects Retry-After headers returned from Azure OpenAI to trigger a temporary open circuit for that endpoint.
  • Random distribution of Azure OpenAI requests across any available backends (non-429 && non-5xx status).
  • Automatic retries of failing requests across remaining available backends.
  • Return of 429 status to OpenAI Python API Library once all backends are exhausted. The Retry-After header value will be the lowest / soonest of all backends to facilitate a very likely successful retry by the OpenAI Python API Library as soon as possible.

Relevant links:


This PR can be merged after @pamelafox's approval.

simonkurtz-MSFT avatar May 17 '24 00:05 simonkurtz-MSFT

Hi @pamelafox & @kristapratico,

This is how the OpenAI Priority Load Balancer integrates. Nevermind the hard-coded backend and the location of the backends list in this PR. I don't intend to ask for a merge, but this was the best way to give you an idea of the setup.

If you have two AOAI instances with the same model, you can plug them both in and should see load-balancing.

simonkurtz-MSFT avatar May 17 '24 00:05 simonkurtz-MSFT

I brought up two AOAI instances and related assets and configured both instances as backends in app.py. Then I started to have a conversation.

image

image

Both backends are responding. It's important to note that this is not a uniform distribution because available backends are randomized (have to do so as part of multi-process workloads).

image

At no point did the conversation break down or showed any kind of error through the chat bot.

simonkurtz-MSFT avatar May 17 '24 15:05 simonkurtz-MSFT