Rate limit reached
Have you guys encountered the following error? Does it caused by the loop? For a long md file, it only successfully generated a few Q-A pairs, others are all marked as error.
ERROR (Rate limit reached for default-gpt-3.5-turbo in organization org-fJHvwd2zRsGsZjjagX8jhZe5 on tokens per min. Limit: 90000 / min. Current: 86994 / min. Contact us through our help center at help.openai.com if you continue to have issues.): Could not generate text for an input.
I have safety for the number of requests per minutes but not the number of tokens per minutes (as those where not needed in my use cases). Reducing chunk size might help, otherwise a hacky solution would be to reduce one of those values (you could pick as a function of the number of your average number of token per request). This will make the code slower but given a low enough value you should be fine.
I have safety for the number of requests per minutes but not the number of tokens per minutes (as those where not needed in my use cases). Reducing chunk size might help, otherwise a hacky solution would be to reduce one of those values (you could pick as a function of the number of your average number of token per request). This will make the code slower but given a low enough value you should be fine.
Okay, I will try you approach, thank you for your instant response!
Even when I set the parallelism to 5 it was rate limiting. How T.F. can we overcome that? I'm on a pay as you go plan, loaded up $20 for extraction
This issue should be reopened IMHO
You will want to change the limits depending on your plan's settings.
If you have more than one key then you can also use them concurently.
Long term, the best way would be to limit on both request per minutes and token per minute. Feel free to submit a PR if you want to tackle it.
question_extractor is architected really well: it is massively parallel. It queues up all the tasks (in my case it can be 360000) and then the semaphore gate keeps for parallelism. Then tenacity used for back-off in case of rate limit hit. That's one of the three libraries advised / mentioned by the OpenAI docs, and even though the back off strategy constants are not exactly the same, but theoretically all this should be sufficient and should produce the highest possible throughput.
Maybe in my case I just got scared of the ferocious flood of the rate limit messages. I modified the source with an extra boolean parallel function parameter, which unrolls the two main asyncio.gather points.
On top of that I also apply an asyncio.sleep(0.02) after every process_file (in non parallel mode). The idea behind that is the Pay-as-you-go rate limit of 3500 RPM, that translates to 58.3* requests per second. That's about 17ms per requests. So I wait 20ms. This way my version still hits rate limit, but it's very jovial, maybe 2-3 rate limit hits per file as it iterates through.
I'll push my fork soon and we can look at it.
My fork: https://github.com/CsabaConsulting/question_extractor
I have not looked at your fork but if your approach appears sound if your numbers (like 0.02) can be made explicitly dependents on OpenAI's API limits (as a function of RPM with maybe a comment pointing to OpenAI's current policy, to avoid magic numbers in the codebase).
My forked botched this up a lot. I used a different fork for the other PR because of that. I'd need to think how to distill the changes nicely.
@MrCsabaToth @nestordemeure @GAOChengzhan I'm the maintainer of LiteLLM - I believe we can help with this problem - I'd love your feedback if LiteLLM is missing something
LiteLLM router allows you to maximize your throughput by using multiple gpt deployments
Here's the quick start: docs: https://docs.litellm.ai/docs/routing
from litellm import Router
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-v-2", # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-functioncalling",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "gpt-3.5-turbo",
"api_key": os.getenv("OPENAI_API_KEY"),
}
}]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = await router.acompletion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response)
Hi @ishaan-jaff, thank you for your message! I took a look at LiteLLM and it sounds like an interesting tool for people wanting to deal with several API keys automatically.
However, using several keys (while possible) is not a focus for Question Extractor: my focus is on not reaching rate limits for a single key (a more common usecase to my knowledge).
@nestordemeure if we added the ability to throttle requests / queue them would you be able to try it and give us feedback ?
Good question, I would be open to introducing a light dependency if you provide a PR but LiteLLM seem too large for my tastes here (the documentation mentions things like a database).
The problem should be fixeable with a single file dependency dealing with rates, so I would rather have that in the codebase.