paper-qa
paper-qa copied to clipboard
Add rate limits for LLMs and Embedding Models
LiteLLM's rate limits weren't suitable for PaperQA in that we wanted rate limits that could span models. This PR adds them in with both an in-memory based rate limiter, as well as a Redis based one for rate limiting across processes.
The implementation adds a new decorator, rate_limited to the LiteLLMModel class across all 4 inference methods. This decorator checks for rate limits before (with prompt tokens) and after (with completion tokens) inference. If tokens aren't known (like when using the *_iter methods), then it uses an estimate with character count divided by the CHARACTERS_PER_TOKEN constant (4). It's technically possible, with low rate limits that don't correspond with a max_token cutoff, that the completions tokens could exceed your maximum allowable tokens in your window of time (say your limit is 20 tokens per second and 100 tokens come back). In this case the rate limiter will wait it out such that your amortized rate will fall back to 20 tokens per second.
The implementation is similar for the LiteLLMModel and the LiteLLMEmbeddingModel, where you give the config attribute a key for rate limits like this:
llm = LiteLLMModel(name='gpt-4o-mini', config={
"rate_limit": {"gpt-4o-mini": RateLimitItemPerSecond(20, 3)},
})
or
llm = LiteLLMModel(name='gpt-4o-mini', config={
"model_list": [
{
"model_name": "gpt-4o-mini",
"litellm_params": {
"model": "gpt-4o-mini",
"temperature": 0,
},
}
],
"rate_limit": {"gpt-4o-mini": RateLimitItemPerSecond(20, 1)},
},
})
and for the embedding model:
embedding = LiteLLMEmbeddingModel(name='text-embedding-3-small', config={"rate_limit": RateLimitItemPerSecond(20, 5)})
I tried this on pqa ask with three papers, and I get a rate limit error.
I turned on default settings (confusingly pqa ask defaults to high_quality)
Was using our tier 1 project, just three medium size pdfs in the index.
I tried this on
pqa askwith three papers, and I get a rate limit error.I turned on default settings (confusingly
pqa askdefaults tohigh_quality)Was using our tier 1 project, just three medium size pdfs in the index.
You can ask via pqa --settings 'tier1_limits' ask 'can pigs fly?' and you should be good now
Hey @mskarlin could you point me to how you ended up dealing with TPM based parallel request rate limiting ?