LibreChat icon indicating copy to clipboard operation
LibreChat copied to clipboard

feat: load balancing Google Vertex AI API across US/CA regions

Open msg7086 opened this issue 9 months ago • 1 comments

Summary

Google Vertex AI API provided by Google Cloud has a request limit quota of 1 request per minute per region as of now. If you are having a conversation with Gemini 1.5 Pro / Flash, and you reply more than twice in a minute, you'll hit quota limit, and have to wait. Load balancing across multiple region solves this problem.

It also improves load on Google side, preventing us-central1 region from being flooded by requests from the same app.

The code change is minimal, so that it doesn't impact user experience. The list only includes US/CA regions for now because they are close to the previous option us-central1. Those who live close to US central should not see any performance impact. Those who don't connect to US central well may see performance improvement.

This is a premature implementation to mitigate #2723.

Change Type

  • [x] New feature (non-breaking change which adds functionality)

Testing

TBD

Checklist

  • [x] My code adheres to this project's style guidelines
  • [x] I have performed a self-review of my own code
  • [x] I have commented in any complex areas of my code
  • [ ] My changes do not introduce new warnings
  • [ ] Local unit tests pass with my changes

msg7086 avatar May 19 '24 21:05 msg7086