LibreChat
LibreChat copied to clipboard
Enhancement: Load balancing on Gemini API
What features would you like to see added?
As shown here https://github.com/danny-avila/LibreChat/blob/94eeec354e5a75a5604f61665cddd3d1afbc76f5/api/app/clients/GoogleClient.js#L24
We only use us-central1 endpoint, which puts query stress all over us-central1 servers, and also put users under the query limit of 1-2 queries per minute per region.
It would be great if you can load balancing this over all regions endpoints, to better spread the stress and also to get around with per region query limits.
More details
Due to quota Generate content requests per minute per project per base model per minute per region per base_model
, the amount of requests is limited by per minute per region per base model, and the limit is usually 1. This will be used up very quickly if you are having a conversation with short sentences with Gemini.
Many other regions provide the same capabilities.
['us-west1', 'us-west4', 'us-central1', 'us-south1', 'us-east4', 'northamerica-northeast1', 'europe-central2', 'europe-west1', 'europe-west2', 'europe-west3', 'europe-west4', 'europe-west6', 'asia-east1', 'asia-east2', 'asia-south1', 'asia-northeast1', 'asia-northeast3', 'australia-southeast1'] (may not be a complete list)
We can utilize all of them, and if possible, give the users the ability to override which regions to use from .env
file.
We can pick regions randomly, or we can do LRU. The goal is to put query stress evenly on all Google regions, and have a much lower chance to hit quota limit and get an error.
Which components are impacted by your request?
Endpoints
Pictures
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct