LibreChat feat: load balancing Google Vertex AI API across US/CA regions

feat: load balancing Google Vertex AI API across US/CA regions

Open msg7086 opened this issue 9 months ago • 1 comments

Summary

Google Vertex AI API provided by Google Cloud has a request limit quota of 1 request per minute per region as of now. If you are having a conversation with Gemini 1.5 Pro / Flash, and you reply more than twice in a minute, you'll hit quota limit, and have to wait. Load balancing across multiple region solves this problem.

It also improves load on Google side, preventing us-central1 region from being flooded by requests from the same app.

The code change is minimal, so that it doesn't impact user experience. The list only includes US/CA regions for now because they are close to the previous option us-central1. Those who live close to US central should not see any performance impact. Those who don't connect to US central well may see performance improvement.

This is a premature implementation to mitigate #2723.

Change Type

[x] New feature (non-breaking change which adds functionality)

Testing

TBD

Checklist

[x] My code adheres to this project's style guidelines
[x] I have performed a self-review of my own code
[x] I have commented in any complex areas of my code
[ ] My changes do not introduce new warnings
[ ] Local unit tests pass with my changes

May 19 '24 21:05 msg7086

LibreChat LibreChat copied to clipboard

feat: load balancing Google Vertex AI API across US/CA regions

Summary

Change Type

Testing

Checklist

LibreChat
LibreChat copied to clipboard