LibreChat icon indicating copy to clipboard operation
LibreChat copied to clipboard

Enhancement: Load balancing on Gemini API

Open msg7086 opened this issue 9 months ago • 0 comments

What features would you like to see added?

As shown here https://github.com/danny-avila/LibreChat/blob/94eeec354e5a75a5604f61665cddd3d1afbc76f5/api/app/clients/GoogleClient.js#L24

We only use us-central1 endpoint, which puts query stress all over us-central1 servers, and also put users under the query limit of 1-2 queries per minute per region.

It would be great if you can load balancing this over all regions endpoints, to better spread the stress and also to get around with per region query limits.

More details

Due to quota Generate content requests per minute per project per base model per minute per region per base_model, the amount of requests is limited by per minute per region per base model, and the limit is usually 1. This will be used up very quickly if you are having a conversation with short sentences with Gemini.

Many other regions provide the same capabilities.

['us-west1', 'us-west4', 'us-central1', 'us-south1', 'us-east4', 'northamerica-northeast1', 'europe-central2', 'europe-west1', 'europe-west2', 'europe-west3', 'europe-west4', 'europe-west6', 'asia-east1', 'asia-east2', 'asia-south1', 'asia-northeast1', 'asia-northeast3', 'australia-southeast1'] (may not be a complete list)

We can utilize all of them, and if possible, give the users the ability to override which regions to use from .env file.

We can pick regions randomly, or we can do LRU. The goal is to put query stress evenly on all Google regions, and have a much lower chance to hit quota limit and get an error.

Which components are impacted by your request?

Endpoints

Pictures

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

msg7086 avatar May 14 '24 21:05 msg7086