question_extractor Rate limit reached

Have you guys encountered the following error? Does it caused by the loop? For a long md file, it only successfully generated a few Q-A pairs, others are all marked as error. ERROR (Rate limit reached for default-gpt-3.5-turbo in organization org-fJHvwd2zRsGsZjjagX8jhZe5 on tokens per min. Limit: 90000 / min. Current: 86994 / min. Contact us through our help center at help.openai.com if you continue to have issues.): Could not generate text for an input.

Jun 20 '23 22:06 GAOChengzhan

I have safety for the number of requests per minutes but not the number of tokens per minutes (as those where not needed in my use cases). Reducing chunk size might help, otherwise a hacky solution would be to reduce one of those values (you could pick as a function of the number of your average number of token per request). This will make the code slower but given a low enough value you should be fine.

Jun 20 '23 22:06 nestordemeure

I have safety for the number of requests per minutes but not the number of tokens per minutes (as those where not needed in my use cases). Reducing chunk size might help, otherwise a hacky solution would be to reduce one of those values (you could pick as a function of the number of your average number of token per request). This will make the code slower but given a low enough value you should be fine.

Okay, I will try you approach, thank you for your instant response!

Jun 21 '23 00:06 GAOChengzhan

Even when I set the parallelism to 5 it was rate limiting. How T.F. can we overcome that? I'm on a pay as you go plan, loaded up $20 for extraction

Aug 30 '23 17:08 MrCsabaToth

This issue should be reopened IMHO

Aug 30 '23 17:08 MrCsabaToth

You will want to change the limits depending on your plan's settings.

If you have more than one key then you can also use them concurently.

Aug 30 '23 17:08 nestordemeure

Long term, the best way would be to limit on both request per minutes and token per minute. Feel free to submit a PR if you want to tackle it.

Aug 30 '23 17:08 nestordemeure

question_extractor is architected really well: it is massively parallel. It queues up all the tasks (in my case it can be 360000) and then the semaphore gate keeps for parallelism. Then tenacity used for back-off in case of rate limit hit. That's one of the three libraries advised / mentioned by the OpenAI docs, and even though the back off strategy constants are not exactly the same, but theoretically all this should be sufficient and should produce the highest possible throughput.

Maybe in my case I just got scared of the ferocious flood of the rate limit messages. I modified the source with an extra boolean parallel function parameter, which unrolls the two main asyncio.gather points.

On top of that I also apply an asyncio.sleep(0.02) after every process_file (in non parallel mode). The idea behind that is the Pay-as-you-go rate limit of 3500 RPM, that translates to 58.3* requests per second. That's about 17ms per requests. So I wait 20ms. This way my version still hits rate limit, but it's very jovial, maybe 2-3 rate limit hits per file as it iterates through.

I'll push my fork soon and we can look at it.

Sep 11 '23 02:09 MrCsabaToth

My fork: https://github.com/CsabaConsulting/question_extractor

Sep 11 '23 17:09 MrCsabaToth

I have not looked at your fork but if your approach appears sound if your numbers (like 0.02) can be made explicitly dependents on OpenAI's API limits (as a function of RPM with maybe a comment pointing to OpenAI's current policy, to avoid magic numbers in the codebase).

Nov 10 '23 00:11 nestordemeure

My forked botched this up a lot. I used a different fork for the other PR because of that. I'd need to think how to distill the changes nicely.

Nov 10 '23 01:11 MrCsabaToth

@MrCsabaToth @nestordemeure @GAOChengzhan I'm the maintainer of LiteLLM - I believe we can help with this problem - I'd love your feedback if LiteLLM is missing something

LiteLLM router allows you to maximize your throughput by using multiple gpt deployments

Here's the quick start: docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "gpt-3.5-turbo", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = await router.acompletion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

Nov 23 '23 01:11 ishaan-jaff

Hi @ishaan-jaff, thank you for your message! I took a look at LiteLLM and it sounds like an interesting tool for people wanting to deal with several API keys automatically.

However, using several keys (while possible) is not a focus for Question Extractor: my focus is on not reaching rate limits for a single key (a more common usecase to my knowledge).

Nov 23 '23 19:11 nestordemeure

@nestordemeure if we added the ability to throttle requests / queue them would you be able to try it and give us feedback ?

Nov 23 '23 19:11 ishaan-jaff

Good question, I would be open to introducing a light dependency if you provide a PR but LiteLLM seem too large for my tastes here (the documentation mentions things like a database).

The problem should be fixeable with a single file dependency dealing with rates, so I would rather have that in the codebase.

Nov 23 '23 22:11 nestordemeure