haystack
haystack copied to clipboard
Token Limit in Embedding Calculation doesn't correspond to newest OpenAI limits
Describe the bug Token Limit in Embedding doesn't correspond to newest (Azure) OpenAI limits. It seems to be hard-coded to 512? If there is already an argument introduced to change this token limit, I am also interested in that.
The token limit was introduced in March 2023: https://github.com/deepset-ai/haystack/pull/4179
There is a fix using model_max_length
introduced in April 2023 for HF Promptnode, but not for the OpenAI models: https://github.com/deepset-ai/haystack/pull/4651
Current rate limits are gpt-35-turbo: 4096 (source) text-embedding-ada-002: 2048 (source)
Error message
Calculating embeddings: 12%|█▏ | 10/83 [00:16<02:36, 2.14s/it]06/21/2023 01:33:36 PM The prompt has been truncated from 538 tokens to 512 tokens to fit within the max token limit. Reduce the length of the prompt to prevent it from being cut off
Expected behavior The exception should not be thrown if the length of the message (538) is shorter than the current rate limit.
Additional context Reader & PromptNode Setup
- name: Retriever
type: EmbeddingRetriever
params:
document_store: DocumentStore
embedding_model: text-embedding-ada-002
api_key: KEY
azure_base_url: URL
azure_deployment_name: NAME
top_k: 5
- name: Reader
type: FARMReader
params:
model_name_or_path: deepset/roberta-base-squad2
context_window_size: 500
return_no_answer: true
- name: Prompt
type: PromptNode
params:
model_name_or_path: gpt-35-turbo
default_prompt_template: Template
max_length: 2000
api_key: KEY
model_kwargs:
azure_deployment_name: NAME
azure_base_url: URL
temperature: 0.8
To Reproduce Setup a pipeline with a PromptNode and use the Retriever, Reader and Prompt as above:
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: Reader
inputs: [Retriever]
- name: Prompt
inputs: [Reader]
I'm happy to provide our full pipeline if needed. Trying not to clutter the issue here.
FAQ Check
- [x] Have you had a look at our new FAQ page?
System:
- OS: Ubuntu
- GPU/CPU: GPU
- Haystack version (commit or version number): v1.16.1 from Docker hub
- DocumentStore: Elastic Search
- Reader: FARMReader (deepset/roberta-base-squad2)
- Retriever: EmbeddingRetriever (text-embedding-ada-002 Azure)
- Prompt: PromptNode (gpt-35-turbo Azure)
Edit: this might be related to the tokenizer TikToken, and might "just" need an argument passed to TikToken, but that's just a guess.
Edit 2: In an original version of this post, I also had a token limit bug for the Promptnode. I correct myself, Exception: Exception while running node 'Prompt': The prompt or the messages are too long (692 tokens). The length of the prompt or messages and the answer (4000 tokens) should be within the max token limit (4096 tokens). Reduce the length of the prompt or messages.
-> This does indeed exceed the token limit if added together. Excuse the confusion.
Following up on this,
would recommend using something like reliableGPT
to handle context window limitations. You can wrap the openai create base with this, and it'll handle retries, model switching, etc.
from reliablegpt import reliableGPT
openai.ChatCompletion.create = reliableGPT(openai.ChatCompletion.create, user_email=...)
Source: https://github.com/BerriAI/reliableGPT
Hey @LisaGLH, thanks for such a clearly explained issue. I wanted to confirm that we don't check for these limits in Azure invocation layer but I see that we do use haystack/utils/openai_utils.p in OpenAIInvocationLayer
init and thus as well in AzureOpenAIInvocationLayer
which extends OpenAIInvocationLayer
. We also check for the limits in ChatGPTInvocationLayer
and AzureChatGPTInvocationLayer
using the same approach. But perhaps, I misunderstood your issue request. Let me know if there is a specific scenario where these limits are not appropriately set.
My best guess for this phenomenon is that your max_length is 2000 tokens (the answer size), while gpt-3.5-turbo has a context window of 4096 tokens. That leaves appx 2k tokens for your prompt, which includes the context documents pulled from the document store. Would you please try lowering your max_length to 256-512 (unless you are expecting elaborate and lengthy answers from gpt-3.5-turbo). I've added additional unit tests for token limit warnings in OpenAI in https://github.com/deepset-ai/haystack/pull/5351
@julian-risch @ZanSara approved new unit tests for the token limits and OpenAI; they have been merged now. Perhaps if we don't hear from @LisaGLH by the end of the sprint we can close this one.
Hi @vblagoje,
you detailed out what I also wrote in my Edit - I agree that in the Promptnode when max_length
is set, this could potentially violate the limit and is therefoer not allowed.
Edit 2: In an original version of this post, I also had a token limit bug for the Promptnode. I correct myself, Exception: Exception while running node 'Prompt': The prompt or the messages are too long (692 tokens). The length of the prompt or messages and the answer (4000 tokens) should be within the max token limit (4096 tokens). Reduce the length of the prompt or messages. -> This does indeed exceed the token limit if added together. Excuse the confusion.
My issue is about the Token Limit in the Embedding (so ada), where we don't have any max_length
(this applies to the PromptNode).
If I understand you correctly, you should set max_seq_len
value of the EmbeddingRetriever. The default is 512, but these OpenAI embedding models support longer sequence lengths. So if you are dealing with longer documents, simply increase your EmbeddingRetriever max_seq_len
attribute to an appropriate value. For the first generation of embedding models, which are now being deprecated, max_seq_len
is 2046, while for the newer it is 8191.
@LisaGLH I believe the recommendation above will fix the issue you are dealing with. If not, please feel free to reopen.
I see, and nice. Could it be an option to not set it by default or to not set it automatically for OpenAI? Alternatively, I'd at least advocate for a more concise error. The error suggests that it's the model, in my case OpenAI's ada (which it's not).
Calculating embeddings: 12%|█▏ | 10/83 [00:16<02:36, 2.14s/it]06/21/2023 01:33:36 PM The prompt has been truncated from 538 tokens to 512 tokens to fit within the max token limit. Reduce the length of the prompt to prevent it from being cut off.
You can also change `max_seq_len` for the EmbeddingRetriever in accordance with the model you used.
Glad it worked @LisaGLH Our current EmbeddingRetriever design aims to hide as many complexities as possible. This means you could just modify the embedding model name and expect it to function properly. However, this approach has its limitations, such as the need to adjust settings like max_seq_len
. In the near future, we plan to make the embedding model and its related settings more explicit for greater flexibility.