langchain icon indicating copy to clipboard operation
langchain copied to clipboard

RetrievalQAWithSourcesChain causing openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens

Open limcheekin opened this issue 2 years ago • 16 comments

I want to migrate from VectorDBQAWithSourcesChain to RetrievalQAWithSourcesChain. The sample code use Qdrant vector store, it work fine with VectorDBQAWithSourcesChain.

When I run the code with RetrievalQAWithSourcesChain changes, it prompt me the following error:

openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 4411 tokens (4155 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

The following is the git diff of the code:

diff --git a/ask_question.py b/ask_question.py
index eac37ce..e76e7c5 100644
--- a/ask_question.py
+++ b/ask_question.py
@@ -2,7 +2,7 @@ import argparse
 import os
 
 from langchain import OpenAI
-from langchain.chains import VectorDBQAWithSourcesChain
+from langchain.chains import RetrievalQAWithSourcesChain
 from langchain.vectorstores import Qdrant
 from langchain.embeddings import OpenAIEmbeddings
 from qdrant_client import QdrantClient
@@ -14,8 +14,7 @@ args = parser.parse_args()
 url = os.environ.get("QDRANT_URL")
 api_key = os.environ.get("QDRANT_API_KEY")
 qdrant = Qdrant(QdrantClient(url=url, api_key=api_key), "docs_flutter_dev", embedding_function=OpenAIEmbeddings().embed_query)
-chain = VectorDBQAWithSourcesChain.from_llm(
-        llm=OpenAI(temperature=0, verbose=True), vectorstore=qdrant, verbose=True)
+chain = RetrievalQAWithSourcesChain.from_chain_type(OpenAI(temperature=0), chain_type="stuff", retriever=qdrant.as_retriever())
 result = chain({"question": args.question})
 
 print(f"Answer: {result['answer']}")

If you need the code of data ingestion (create embeddings), please check it out: https://github.com/limcheekin/flutter-gpt/blob/openai-qdrant/create_embeddings.py

Any idea how to fix it?

Thank you.

limcheekin avatar Mar 29 '23 03:03 limcheekin

same issue here

kxykxyou avatar Mar 30 '23 08:03 kxykxyou

I am also looking for a way to limit the amount of data being sent to an external source. Ideally, I would prefer to not make any changes to the existing code.

However, I believe the best solution would be to implement a limiter to the classes that interact with OpenAI's API. This is because we are aware of the limits and can truncate text before sending it. We could also display a warning message in the console.

https://github.com/hwchase17/langchain/issues/2140

sergerdn avatar Mar 30 '23 08:03 sergerdn

I met the same problem when I want to query in some chinese documents which I put them into chromadb

embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])
docsearch = Chroma.from_documents(texts, embeddings)
retriever = docsearch.as_retriever()
qa = RetrievalQA.from_llm(llm=OpenAI(), retriever=retriever)

query = "what"
qa.run(query)

You can see my prompt is very short, but it says: InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 6990 tokens (6734 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

qingtian1771 avatar Mar 30 '23 10:03 qingtian1771

You can try setting reduce_k_below_max_tokens=True, it is supposed to limit the number of results to return from store based on tokens limit. But for some reason, it seems that it only works on chat models (GPT-3.5/GPT-4), at least from my testing.

harithzulfaizal avatar Mar 31 '23 08:03 harithzulfaizal

You can try setting reduce_k_below_max_tokens=True, it is supposed to limit the number of results to return from store based on tokens limit. But for some reason, it seems that it only works on chat models (GPT-3.5/GPT-4), at least from my testing.

That's works! Please review the latest code at https://github.com/limcheekin/flutter-gpt/blob/openai/ask_question.py, appreciate if you spot any improvement.

Thanks.

limcheekin avatar Apr 03 '23 04:04 limcheekin

You can try setting reduce_k_below_max_tokens=True, it is supposed to limit the number of results to return from store based on tokens limit. But for some reason, it seems that it only works on chat models (GPT-3.5/GPT-4), at least from my testing.

this a good job,but i dont know how to set reduce_k_below_max_tokens=True?Can you give me some examples

wen020 avatar Apr 03 '23 10:04 wen020

Please see my code example at https://github.com/limcheekin/flutter-gpt/blob/openai/ask_question.py#L42

Let's see if it works for you.

limcheekin avatar Apr 03 '23 11:04 limcheekin

Thanks, @limcheekin , I saw your code, but in your code you use RetrievalQAWithSourcesChain, which has reduce_k_below_max_tokens field, as following reference said. https://python.langchain.com/en/latest/reference/modules/chains.html#langchain.chains.RetrievalQAWithSourcesChain

But I use RetrievalQA, it did not has reduce_k_below_max_tokens field, please check this reference https://python.langchain.com/en/latest/reference/modules/chains.html#langchain.chains.RetrievalQA

should I change to RetrievalQAWithSourcesChain? what's the different between them?

Thanks!

qingtian1771 avatar Apr 04 '23 03:04 qingtian1771

There is a more fundamental fix for this issue: retriever=VectorStoreRetriever(vectorstore=vectorstore, search_kwargs={"filter":{"type":"filter"},"k":3},),

I guess it will work for qdrant as well

Jeru2023 avatar Apr 05 '23 14:04 Jeru2023

Thanks, @limcheekin , I saw your code, but in your code you use RetrievalQAWithSourcesChain, which has reduce_k_below_max_tokens field, as following reference said. https://python.langchain.com/en/latest/reference/modules/chains.html#langchain.chains.RetrievalQAWithSourcesChain

But I use RetrievalQA, it did not has reduce_k_below_max_tokens field, please check this reference https://python.langchain.com/en/latest/reference/modules/chains.html#langchain.chains.RetrievalQA

should I change to RetrievalQAWithSourcesChain? what's the different between them?

Thanks!

我也有这个问题,我用的是openai.ChatCompletion

frankg1 avatar Apr 11 '23 09:04 frankg1

@wen020

There seems to be max_tokens_limit parameter, but that is not reflected in the documentation.

This fixed my issue:

ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), vectorstore.as_retriever(search_kwargs={"k": 1}), max_tokens_limit=4097)

ZiyadMoraished avatar Apr 12 '23 09:04 ZiyadMoraished

same issue here

qixiaobo avatar Apr 13 '23 08:04 qixiaobo

This does not work with RetrievalQA, it only works with RetrievalQAWithSourcesChain

GitNinja42 avatar Apr 17 '23 18:04 GitNinja42

@ZiyadMoraished The solution works only for one type of chain ConversationalRetrievalChain, for others its not working

budhewarvijay0407 avatar Apr 20 '23 07:04 budhewarvijay0407

First of all, on data preparation, how you split the text into chunks is important, please look into the following video and notebook to understand more: https://www.youtube.com/watch?v=eqOfr4AGLk8 https://github.com/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb

Also, the recent blog post from LangChain official blog: https://blog.langchain.dev/improving-document-retrieval-with-contextual-compression/

These two solutions should solve the issue, I will close the issue if it is quiet for a week.

Thanks.

limcheekin avatar Apr 24 '23 02:04 limcheekin

Hi, trying to implement the suggested solutions myself right now; where would one add these parameters? Thanks!

kruulik avatar May 02 '23 01:05 kruulik

Still have that same issue with RetrievalQA.from_chain_type

bluusun avatar May 12 '23 01:05 bluusun

Please read the issue from start to finish, there are few solutions being mentioned. Try it out one by one, it should fix the issue. Otherwise, please open a new issue for the problem you have and share the Github repo so that it is ease for other to re-produce the problem and help you.

The issue is dated. I think most people here got their solution or workaround for the issue, I will close the issue for now.

limcheekin avatar May 12 '23 03:05 limcheekin

I am not sure :)

I switched to this which seemed to work:

chain = VectorDBQAWithSourcesChain.from_llm( llm=OpenAI(temperature=0, verbose=True), vectorstore=store, verbose=True) result = chain({"question": args.question})

Which got me:

UserWarning: VectorDBQAWithSourcesChain is deprecated - please use from langchain.chains import RetrievalQAWithSourcesChain warnings.warn(

And I modified to this: qa= RetrievalQAWithSourcesChain.from_chain_type( llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True

)

And I am back at the same error ...

openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 5541 tokens (5285 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

bluusun avatar May 12 '23 17:05 bluusun

You can try setting reduce_k_below_max_tokens=True, it is supposed to limit the number of results to return from store based on tokens limit. But for some reason, it seems that it only works on chat models (GPT-3.5/GPT-4), at least from my testing.

That's works! Please review the latest code at https://github.com/limcheekin/flutter-gpt/blob/openai/ask_question.py, appreciate if you spot any improvement.

Thanks.

I'm new to this, where should I add this code?

Thanks

gisligeir avatar May 15 '23 13:05 gisligeir

same question

hyp530 avatar May 21 '23 11:05 hyp530

The following notebook should be helpful: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

If none of the solutions above fixed your problem, most likely that the issue is caused by the data, so make sure you understand your data preparation process.

limcheekin avatar May 22 '23 08:05 limcheekin

if you are using an agent without chains but have a vector store(pinecone in my case) as a retrieval tool, where would you limit either max_tokens or top_k?

tevslin avatar May 26 '23 15:05 tevslin

The following answer is come from above:

vectorstore.as_retriever(search_kwargs={"k": 1})

limcheekin avatar May 26 '23 23:05 limcheekin

I met the same problem when I want to query in some chinese documents which I put them into chromadb

embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])
docsearch = Chroma.from_documents(texts, embeddings)
retriever = docsearch.as_retriever()
qa = RetrievalQA.from_llm(llm=OpenAI(), retriever=retriever)

query = "what"
qa.run(query)

You can see my prompt is very short, but it says: InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 6990 tokens (6734 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

Same error, Any updates here?

xumeng avatar Jun 30 '23 06:06 xumeng

same error for me, since it is the limit of openAI, if it the output length is too large, I think we can set max_tokens to limit the output length, but if it is the input context length is too large, I don't know how to do by using some parameters, what I can think of is: reduce the context input, or use compression (I still didn't try it), I will try to reduce my input context size.

saaspeter avatar Aug 02 '23 13:08 saaspeter

第2140章

当我想查询一些中文文档并将其放入 chromadb 时,我遇到了同样的问题

embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])
docsearch = Chroma.from_documents(texts, embeddings)
retriever = docsearch.as_retriever()
qa = RetrievalQA.from_llm(llm=OpenAI(), retriever=retriever)

query = "what"
qa.run(query)

您可以看到我的提示非常短,但它显示: InvalidRequestError: This model's max context length is 4097 tokens, but you requests 6990 tokens (提示中为 6734;完成时为 256)。请减少提示;或完成长度。

How did you solve it??

GeneralLHW avatar Aug 08 '23 08:08 GeneralLHW

@GeneralLHW , I think the error is saying: your context is too long, the context is from the chromadb, so it is the document in the chromadb is too long, I think. You can set verbose=true, to see the complete question and response to OpenAI.

saaspeter avatar Aug 08 '23 09:08 saaspeter

@saaspeter ,

llm=ChatOpenAI(max_tokens=4096, model_name='gpt-3.5-turbo',temperature=0, verbose=True)
qa = RetrievalQA.from_chain_type(llm, chain_type="map_rerank", retriever=docsearch.as_retriever(), verbose=True)
result = qa.run('1+1')

I added parameters, but I still haven't seen the complete question and reply, what should I do?

GeneralLHW avatar Aug 09 '23 02:08 GeneralLHW

@GeneralLHW , try to add this: langchain.debug = True in your python file ? for me, by doing this, I can see the full call parameters and response.

saaspeter avatar Aug 09 '23 14:08 saaspeter