paper-qa
paper-qa copied to clipboard
Centralizing Rate Limits
I created an OpenAI project set at the tier 1 usage and tried to get liteLLM to work with lower rate limits. I centralized the Router to originate from the settings, but still am running into a lot of issues.
I've built a Router with
{
"model_list": [
{
"model_name": "gpt-4o-2024-08-06",
"litellm_params": {"model": "gpt-4o-2024-08-06", "temperature": 0.0},
"rpm": 500,
"tpm": 30000,
}
],
"num_retries": 3,
"retry_after": 10,
"enable_pre_call_checks": True,
}
The issues are that - with async concurrency:
- The rate limits are not respected and we're getting 429s from OpenAI. LiteLLM also does not actually respect the message in 429 to wait the requested time (e.g., retry in 3.5 seconds)
- The logs are full of "Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new" regardless of my log settings
- Even if it did work, we have a fundamental mismatch in usage. Because of how gpt-4o rate limits work, we need to share rate limits across multiple models - but litellm will only share across deployments (which are keyed to one model or group of models). This is the opposite of what we need.
I reproduced these errors in the min file router_test.py which triggers 429, even though the rate limits are set to match my OpenAI rate limits.
Opening this PR to start discussion on how to proceed