LibreChat
LibreChat copied to clipboard
feat: load balancing Google Vertex AI API across US/CA regions
Summary
Google Vertex AI API provided by Google Cloud has a request limit quota of 1 request per minute per region as of now. If you are having a conversation with Gemini 1.5 Pro / Flash, and you reply more than twice in a minute, you'll hit quota limit, and have to wait. Load balancing across multiple region solves this problem.
It also improves load on Google side, preventing us-central1 region from being flooded by requests from the same app.
The code change is minimal, so that it doesn't impact user experience. The list only includes US/CA regions for now because they are close to the previous option us-central1
. Those who live close to US central should not see any performance impact. Those who don't connect to US central well may see performance improvement.
This is a premature implementation to mitigate #2723.
Change Type
- [x] New feature (non-breaking change which adds functionality)
Testing
TBD
Checklist
- [x] My code adheres to this project's style guidelines
- [x] I have performed a self-review of my own code
- [x] I have commented in any complex areas of my code
- [ ] My changes do not introduce new warnings
- [ ] Local unit tests pass with my changes