langchain icon indicating copy to clipboard operation
langchain copied to clipboard

OpenAI Rate Limits

Open slavakurilyak opened this issue 2 years ago • 8 comments

It would be great to follow the rate limits guide by OpenAI to ensure that this LLM works as expected no matter what chunk size or chunk overlap.

See related issue: https://github.com/hwchase17/langchain/issues/634

slavakurilyak avatar Feb 26 '23 20:02 slavakurilyak

we added tenacity to handle retries for the LLM wrapper... did this not suffice for your use case?

hwchase17 avatar Feb 27 '23 15:02 hwchase17

I am still hitting rate limit errors for long documents, even if I throttle the embedding calls themselves to 2 RPS. I will have to throttle aggressively to avoid the rate limit errors.

damosuzuki avatar Mar 03 '23 15:03 damosuzuki

Got some weird rate limit errors here as well...until I turned down the chunk size.

Even though it's well within rate limits, the default chunk size of 1000 didn't work at all for me - only got rate limit errors w/ no completed embeds.

Turned it all the way down to 25 and am now getting 0 errors; still decently fast.

(oh, and for context I was attempting to create embeddings of chat histories)

kryptoklob avatar Mar 18 '23 16:03 kryptoklob

I am also running into API rate limitations -- not seeing the backoff with 0.0.121:

 raise self.handle_error_response(
openai.error.APIError: Internal error {
    "error": {
        "message": "Internal error",
        "type": "internal_error",
        "param": null,
        "code": "internal_error"
    }
}
 500 {'error': {'message': 'Internal error', 'type': 'internal_error', 'param': None, 'code': 'internal_error'}} {'Date': 'Thu, 23 Mar 2023 20:35:41 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '152', 'Connection': 'keep-alive', 'Vary': 'Origin', 'X-Ratelimit-Limit-Requests': '3000', 'X-Ratelimit-Remaining-Requests': '2999', 'X-Ratelimit-Reset-Requests': '20ms', 'X-Request-Id': 'a6fe030b5f5ff42aa520495e0b<redacted>', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains'}

This is while injecting a large JSON file:

% wc -w file
1003019 file

using:

from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("./file")
docs = loader.load()

db = Chroma.from_documents(documents=docs, embedding=embeddings, persist_directory=persist_directory)

ventz avatar Mar 23 '23 20:03 ventz

I suspect it is not enough time (clearly embed_with_retry is wrapped):

def embed_with_retry(embeddings: OpenAIEmbeddings, **kwargs: Any) -> Any:
    """Use tenacity to retry the completion call."""
    retry_decorator = _create_retry_decorator(embeddings)
            
    @retry_decorator
    def _completion_with_retry(**kwargs: Any) -> Any:
        return embeddings.client.create(**kwargs)
            
    return _completion_with_retry(**kwargs)
    
    
def _create_retry_decorator(embeddings: OpenAIEmbeddings) -> Callable[[Any], Any]:
    import openai
        
    min_seconds = 4
    max_seconds = 10
    # Wait 2^x * 1 second between each retry starting with
    # 4 seconds, then up to 10 seconds, then 10 seconds afterwards
    return retry(
        reraise=True,
        stop=stop_after_attempt(embeddings.max_retries),
        wait=wait_exponential(multiplier=1, min=min_seconds, max=max_seconds),
        retry=(
            retry_if_exception_type(openai.error.Timeout)
            | retry_if_exception_type(openai.error.APIError)
            | retry_if_exception_type(openai.error.APIConnectionError)
            | retry_if_exception_type(openai.error.RateLimitError)
            | retry_if_exception_type(openai.error.ServiceUnavailableError)
        ),
        before_sleep=before_sleep_log(logger, logging.WARNING),
    )

Could you possibly consider using the 2nd suggestion from OpenAI, with backoff -- for exponential backoffs?

    import backoff  # for exponential backoff
    import openai  # for OpenAI API calls

@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def completions_with_backoff(**kwargs):
    return openai.Completion.create(**kwargs)


completions_with_backoff(model="text-davinci-002", prompt="Once upon a time,")

ventz avatar Mar 23 '23 21:03 ventz

Got some weird rate limit errors here as well...until I turned down the chunk size.

Even though it's well within rate limits, the default chunk size of 1000 didn't work at all for me - only got rate limit errors w/ no completed embeds.

Turned it all the way down to 25 and am now getting 0 errors; still decently fast.

(oh, and for context I was attempting to create embeddings of chat histories)

This worked for me as well

juliencarponcy avatar May 09 '23 13:05 juliencarponcy

The problem with using a smaller chunk size if your data size is not "smaller" is that OAI will miss the larger context. It's very much dependent on the average sample size of your data.

If you are grabbing 1-2 sentences and need to look up something within them - 25 tokens will work. If you are grabbing a few paragraphs and need to extract the larger context - 25 tokens will not work well.

ventz avatar May 09 '23 18:05 ventz

Is there a notion of using multiple openai instances and round robin among them?

mayanand avatar May 13 '23 01:05 mayanand

Is there a notion of using multiple openai instances and round robin among them?

good idea! im trying to do this! but the final request, which is still made by openai.api_requestot.py, makes it significantly more difficult to modify. I tried cutting in from os.environ.get, and I'm still trying.

ugfly1210 avatar Jun 12 '23 08:06 ugfly1210

Hi, @slavakurilyak! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue was about implementing rate limits for OpenAI to ensure the LLM works properly regardless of chunk size or overlap. It seems that the issue has been resolved by implementing rate limits for OpenAI. Users can now avoid rate limit errors by turning down the chunk size and using exponential backoff for rate limit errors. Additionally, using multiple OpenAI instances in a round-robin fashion has been suggested as a solution.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Sep 22 '23 16:09 dosubot[bot]

I read multiple threads on this issue now, i can conclude there is a widespread pseudo solution that qualify for critique. By many it is suggested to lower chunk size to avoid getting rate limited. I encourage you all to think deeply about if this is really how you want to go on solving problems in your life and as a developer. It just so happened that by lowering chunks, you induced sufficient delay in your pipeline so that the rate limit was not hit. However this hinders you from using an appropriate chunk size for your specific usecase. And if you try to embed larger content, you will probably encounter the same issue again.

Better solution: Using the tiktoken lib you are able to count the tokens per chunk, from here it should be possible to induce a wait, either by own preferred methods or the methods suggested in the OpenAI documentation.

Update: After internalizing hwchase comment, i updated the library to 0.0.349 from 0.0.335 And it seems it now takes into account the rate limit. I tried to identify which exact version that implemented this essential functionality of a rate limiter, but i gave up on that specific task.

PilotGFX avatar Dec 12 '23 15:12 PilotGFX