haystack Token Limit in Embedding Calculation doesn't correspond to newest OpenAI limits

Describe the bug Token Limit in Embedding doesn't correspond to newest (Azure) OpenAI limits. It seems to be hard-coded to 512? If there is already an argument introduced to change this token limit, I am also interested in that.

The token limit was introduced in March 2023: https://github.com/deepset-ai/haystack/pull/4179 There is a fix using model_max_length introduced in April 2023 for HF Promptnode, but not for the OpenAI models: https://github.com/deepset-ai/haystack/pull/4651

Current rate limits are gpt-35-turbo: 4096 (source) text-embedding-ada-002: 2048 (source)

Error message Calculating embeddings: 12%|█▏ | 10/83 [00:16<02:36, 2.14s/it]06/21/2023 01:33:36 PM The prompt has been truncated from 538 tokens to 512 tokens to fit within the max token limit. Reduce the length of the prompt to prevent it from being cut off

Expected behavior The exception should not be thrown if the length of the message (538) is shorter than the current rate limit.

Additional context Reader & PromptNode Setup

  - name: Retriever
    type: EmbeddingRetriever
    params: 
      document_store: DocumentStore
      embedding_model: text-embedding-ada-002
      api_key: KEY
      azure_base_url: URL
      azure_deployment_name: NAME
      top_k: 5
  - name: Reader      
    type: FARMReader    
    params:
      model_name_or_path: deepset/roberta-base-squad2
      context_window_size: 500
      return_no_answer: true
  - name: Prompt
    type: PromptNode
    params:
      model_name_or_path: gpt-35-turbo
      default_prompt_template: Template
      max_length: 2000
      api_key: KEY
      model_kwargs: 
          azure_deployment_name: NAME
          azure_base_url: URL
          temperature: 0.8

To Reproduce Setup a pipeline with a PromptNode and use the Retriever, Reader and Prompt as above:

pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
      - name: Prompt
        inputs: [Reader]

I'm happy to provide our full pipeline if needed. Trying not to clutter the issue here.

FAQ Check

[x] Have you had a look at our new FAQ page?

System:

OS: Ubuntu
GPU/CPU: GPU
Haystack version (commit or version number): v1.16.1 from Docker hub
DocumentStore: Elastic Search
Reader: FARMReader (deepset/roberta-base-squad2)
Retriever: EmbeddingRetriever (text-embedding-ada-002 Azure)
Prompt: PromptNode (gpt-35-turbo Azure)

Edit: this might be related to the tokenizer TikToken, and might "just" need an argument passed to TikToken, but that's just a guess. Edit 2: In an original version of this post, I also had a token limit bug for the Promptnode. I correct myself, Exception: Exception while running node 'Prompt': The prompt or the messages are too long (692 tokens). The length of the prompt or messages and the answer (4000 tokens) should be within the max token limit (4096 tokens). Reduce the length of the prompt or messages. -> This does indeed exceed the token limit if added together. Excuse the confusion.

Jun 21 '23 14:06 LisaGLH

Following up on this,

would recommend using something like reliableGPT to handle context window limitations. You can wrap the openai create base with this, and it'll handle retries, model switching, etc.

from reliablegpt import reliableGPT
openai.ChatCompletion.create = reliableGPT(openai.ChatCompletion.create, user_email=...)

Source: https://github.com/BerriAI/reliableGPT

Jun 21 '23 22:06 krrishdholakia

Hey @LisaGLH, thanks for such a clearly explained issue. I wanted to confirm that we don't check for these limits in Azure invocation layer but I see that we do use haystack/utils/openai_utils.p in OpenAIInvocationLayerinit and thus as well in AzureOpenAIInvocationLayer which extends OpenAIInvocationLayer. We also check for the limits in ChatGPTInvocationLayer and AzureChatGPTInvocationLayer using the same approach. But perhaps, I misunderstood your issue request. Let me know if there is a specific scenario where these limits are not appropriately set.

Jul 10 '23 09:07 vblagoje

My best guess for this phenomenon is that your max_length is 2000 tokens (the answer size), while gpt-3.5-turbo has a context window of 4096 tokens. That leaves appx 2k tokens for your prompt, which includes the context documents pulled from the document store. Would you please try lowering your max_length to 256-512 (unless you are expecting elaborate and lengthy answers from gpt-3.5-turbo). I've added additional unit tests for token limit warnings in OpenAI in https://github.com/deepset-ai/haystack/pull/5351

Jul 13 '23 09:07 vblagoje

@julian-risch @ZanSara approved new unit tests for the token limits and OpenAI; they have been merged now. Perhaps if we don't hear from @LisaGLH by the end of the sprint we can close this one.

Jul 17 '23 12:07 vblagoje

Hi @vblagoje,

you detailed out what I also wrote in my Edit - I agree that in the Promptnode when max_length is set, this could potentially violate the limit and is therefoer not allowed.

Edit 2: In an original version of this post, I also had a token limit bug for the Promptnode. I correct myself, Exception: Exception while running node 'Prompt': The prompt or the messages are too long (692 tokens). The length of the prompt or messages and the answer (4000 tokens) should be within the max token limit (4096 tokens). Reduce the length of the prompt or messages. -> This does indeed exceed the token limit if added together. Excuse the confusion.

My issue is about the Token Limit in the Embedding (so ada), where we don't have any max_length (this applies to the PromptNode).

Jul 18 '23 08:07 LisaGLH

If I understand you correctly, you should set max_seq_len value of the EmbeddingRetriever. The default is 512, but these OpenAI embedding models support longer sequence lengths. So if you are dealing with longer documents, simply increase your EmbeddingRetriever max_seq_len attribute to an appropriate value. For the first generation of embedding models, which are now being deprecated, max_seq_len is 2046, while for the newer it is 8191.

Jul 18 '23 09:07 vblagoje

@LisaGLH I believe the recommendation above will fix the issue you are dealing with. If not, please feel free to reopen.

Jul 18 '23 09:07 vblagoje

I see, and nice. Could it be an option to not set it by default or to not set it automatically for OpenAI? Alternatively, I'd at least advocate for a more concise error. The error suggests that it's the model, in my case OpenAI's ada (which it's not).

Calculating embeddings:  12%|█▏        | 10/83 [00:16<02:36,  2.14s/it]06/21/2023 01:33:36 PM The prompt has been truncated from 538 tokens to 512 tokens to fit within the max token limit. Reduce the length of the prompt to prevent it from being cut off. 
You can also change `max_seq_len` for the EmbeddingRetriever in accordance with the model you used.

Jul 18 '23 09:07 LisaGLH

Glad it worked @LisaGLH Our current EmbeddingRetriever design aims to hide as many complexities as possible. This means you could just modify the embedding model name and expect it to function properly. However, this approach has its limitations, such as the need to adjust settings like max_seq_len. In the near future, we plan to make the embedding model and its related settings more explicit for greater flexibility.

Jul 18 '23 09:07 vblagoje

haystack haystack copied to clipboard

Token Limit in Embedding Calculation doesn't correspond to newest OpenAI limits

haystack
haystack copied to clipboard