langchain
langchain copied to clipboard
OpenAI Rate Limits
It would be great to follow the rate limits guide by OpenAI to ensure that this LLM works as expected no matter what chunk size or chunk overlap.
See related issue: https://github.com/hwchase17/langchain/issues/634
we added tenacity to handle retries for the LLM wrapper... did this not suffice for your use case?
I am still hitting rate limit errors for long documents, even if I throttle the embedding calls themselves to 2 RPS. I will have to throttle aggressively to avoid the rate limit errors.
Got some weird rate limit errors here as well...until I turned down the chunk size.
Even though it's well within rate limits, the default chunk size of 1000 didn't work at all for me - only got rate limit errors w/ no completed embeds.
Turned it all the way down to 25 and am now getting 0 errors; still decently fast.
(oh, and for context I was attempting to create embeddings of chat histories)
I am also running into API rate limitations -- not seeing the backoff with 0.0.121
:
raise self.handle_error_response(
openai.error.APIError: Internal error {
"error": {
"message": "Internal error",
"type": "internal_error",
"param": null,
"code": "internal_error"
}
}
500 {'error': {'message': 'Internal error', 'type': 'internal_error', 'param': None, 'code': 'internal_error'}} {'Date': 'Thu, 23 Mar 2023 20:35:41 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '152', 'Connection': 'keep-alive', 'Vary': 'Origin', 'X-Ratelimit-Limit-Requests': '3000', 'X-Ratelimit-Remaining-Requests': '2999', 'X-Ratelimit-Reset-Requests': '20ms', 'X-Request-Id': 'a6fe030b5f5ff42aa520495e0b<redacted>', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains'}
This is while injecting a large JSON file:
% wc -w file
1003019 file
using:
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("./file")
docs = loader.load()
db = Chroma.from_documents(documents=docs, embedding=embeddings, persist_directory=persist_directory)
I suspect it is not enough time (clearly embed_with_retry
is wrapped):
def embed_with_retry(embeddings: OpenAIEmbeddings, **kwargs: Any) -> Any:
"""Use tenacity to retry the completion call."""
retry_decorator = _create_retry_decorator(embeddings)
@retry_decorator
def _completion_with_retry(**kwargs: Any) -> Any:
return embeddings.client.create(**kwargs)
return _completion_with_retry(**kwargs)
def _create_retry_decorator(embeddings: OpenAIEmbeddings) -> Callable[[Any], Any]:
import openai
min_seconds = 4
max_seconds = 10
# Wait 2^x * 1 second between each retry starting with
# 4 seconds, then up to 10 seconds, then 10 seconds afterwards
return retry(
reraise=True,
stop=stop_after_attempt(embeddings.max_retries),
wait=wait_exponential(multiplier=1, min=min_seconds, max=max_seconds),
retry=(
retry_if_exception_type(openai.error.Timeout)
| retry_if_exception_type(openai.error.APIError)
| retry_if_exception_type(openai.error.APIConnectionError)
| retry_if_exception_type(openai.error.RateLimitError)
| retry_if_exception_type(openai.error.ServiceUnavailableError)
),
before_sleep=before_sleep_log(logger, logging.WARNING),
)
Could you possibly consider using the 2nd suggestion from OpenAI, with backoff
-- for exponential backoffs?
import backoff # for exponential backoff
import openai # for OpenAI API calls
@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def completions_with_backoff(**kwargs):
return openai.Completion.create(**kwargs)
completions_with_backoff(model="text-davinci-002", prompt="Once upon a time,")
Got some weird rate limit errors here as well...until I turned down the chunk size.
Even though it's well within rate limits, the default chunk size of 1000 didn't work at all for me - only got rate limit errors w/ no completed embeds.
Turned it all the way down to 25 and am now getting 0 errors; still decently fast.
(oh, and for context I was attempting to create embeddings of chat histories)
This worked for me as well
The problem with using a smaller chunk size if your data size is not "smaller" is that OAI will miss the larger context. It's very much dependent on the average sample size of your data.
If you are grabbing 1-2 sentences and need to look up something within them - 25 tokens will work. If you are grabbing a few paragraphs and need to extract the larger context - 25 tokens will not work well.
Is there a notion of using multiple openai instances and round robin among them?
Is there a notion of using multiple openai instances and round robin among them?
good idea! im trying to do this! but the final request, which is still made by openai.api_requestot.py, makes it significantly more difficult to modify. I tried cutting in from os.environ.get, and I'm still trying.
Hi, @slavakurilyak! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue was about implementing rate limits for OpenAI to ensure the LLM works properly regardless of chunk size or overlap. It seems that the issue has been resolved by implementing rate limits for OpenAI. Users can now avoid rate limit errors by turning down the chunk size and using exponential backoff for rate limit errors. Additionally, using multiple OpenAI instances in a round-robin fashion has been suggested as a solution.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!
I read multiple threads on this issue now, i can conclude there is a widespread pseudo solution that qualify for critique. By many it is suggested to lower chunk size to avoid getting rate limited. I encourage you all to think deeply about if this is really how you want to go on solving problems in your life and as a developer. It just so happened that by lowering chunks, you induced sufficient delay in your pipeline so that the rate limit was not hit. However this hinders you from using an appropriate chunk size for your specific usecase. And if you try to embed larger content, you will probably encounter the same issue again.
Better solution: Using the tiktoken lib you are able to count the tokens per chunk, from here it should be possible to induce a wait, either by own preferred methods or the methods suggested in the OpenAI documentation.
Update: After internalizing hwchase comment, i updated the library to 0.0.349 from 0.0.335 And it seems it now takes into account the rate limit. I tried to identify which exact version that implemented this essential functionality of a rate limiter, but i gave up on that specific task.