langchain
langchain copied to clipboard
model's maximum context length
have been very often running into openai.error.InvalidRequestError for getting over 4097 tokens maximum context length. Is there a module/ best practice to manage the context length ?
Quick example I am adding the map reduce summary chain to the URL data loader and its throwing that error:
from langchain.document_loaders import UnstructuredURLLoader
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
from langchain.chains.summarize import load_summarize_chain
from langchain import OpenAI
import os
os.environ["OPENAI_API_KEY"] = "*"
llm = OpenAI(temperature=0)
chain = load_summarize_chain(llm, chain_type="map_reduce")
print(chain.run(data))
Error: openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 8356 tokens (8100 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.
How did you manage this issue?
My case was a bit simpler, in which I was providing context plus questions and retrieving the answers, I was appending the in the message, but I think in your case check that the data you passing is not repeating, if not then pass the data into chunks
I think I confuse my issue with yours, Your case might be different from the one which I faced
I have the same issue. I try to set chunk size for a splitter and it works.
text_splitter = CharacterTextSplitter(chunk_size=3000) docs = WebBaseLoader(url).load_and_split(text_splitter)
get same error any work around?
get the same error
I have the same issue when using PyPDFLoader with load_and_split method.
Same, with Character Text Splitter. Without having looked at the source, my hunch is that the chunking seems to only use the chunk size as a reference but actually chunks on the nearest line break or other character. So there does not seem to be any guarantee that a chunk fits in the context length.
If you have this issue with e.g. the CharacterTextSplitter a work around is to use the RecursiveCharacterTextSplitter and set a couple of separators that work for you. This reduces the chance of having a chunk that doesn't fit, e.g.:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=0, separators=[" ", ",", "\n"]
)
In general, reduce chunk size and set a separator that appears frequently.
I have the same error with CharacterTextSplitter
I have the same issue with [Recursive]CharacterTextSplitter . Specifying custom separators didn't help.
try to use
chain = RetrievalQAWithSourcesChain.from_chain_type(llm, chain_type="stuff", retriever=db.as_retriever(), reduce_k_below_max_tokens=True,)
@nilsec answer worked for me. I switched now to RecursiveCharacterTextSplitter:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=4000, chunk_overlap=0, separators=[" ", ",", "\n"]
)
The problem is not with separation initial data into chunks of right size.
The problem is the following: after creating summaries for each chunk it collects them into one prompt and it could reach the limit. So, bigger chunks could help in some cases, but if there are many splitted documents, after some amount it still will reach token limit.
The issue is still present. I think the correct solution is that load_summarize_chain should check last request for limit based on some parameter like max_tokens or based on chosen model. And split it again for a couple of requests.
Perhaps, for requests there should be recursive algorithm based on limit size.
Example:
- last query for generation final summary based on sub-summaries is more than tokens limit
- it should be splitted into two requests and find summary for both of them (or again splitted)
- try again to generate final summary
The map_reduce should in fact apply recursively, not only once and then hope all the summaries will magically fit into one prompt. It's likely the concatenated summaries won't, and in such case the chain should apply again.
indeed, map_reduce should be applied recursively to fit for 4096
The process should at least give a better trace of the error since it is hard to understand what is happening at first sight.
It could be a bit dangerous to have a recursive process here in my view, even if you set a max_depth. A different feature could be for example to allow you to process the intermediate summaries (LLMs are very verbose, you can easily clean up and save tokens) before passing it to the combined summary.
The recursiveness and cost impacts of course relevant. It could be optional, but still I think it would benefit the user case.
The alternative today is an error. It simply doesn't work for big documents (or I missed something).
El mié, 31 may 2023, 12:48, Luis Roque @.***> escribió:
The process should at least give a better trace of the error since it is hard to understand what is happening at first sight.
It could be a bit dangerous to have a recursive process here in my view, even if you set a max_depth. A different feature could be for example to allow you to process the intermediate summaries (LLMs are very verbose, you can easily clean up and save tokens) before passing it to the combined summary.
— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/1349#issuecomment-1569958767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRLWQXQWZNSA5JLRTCBCTXI4OZLANCNFSM6AAAAAAVLQHNUM . You are receiving this because you commented.Message ID: @.***>
I think the use case here is not to process very large documents (for that, you have to consider other approaches such as using a vectorDB and semantic search) but to extend mildly the token limits from where they are today. Even so, I agree that the error is not verbose, and you should get a bit more control over what the token extension could look like. Are we talking about ingesting a doc of ~40k tokens in an API with a limit of ~4k? So 10x increase? Or just 3x or 4x. I understand the complexity since the models are not deterministic and can output very different things, especially considering that you can change the prompt. But that is why I shared the possibility of processing the intermediate summaries to save tokens since the outputs of the models are very verbose.
+1 for map reduce support recursive algorithm
Any updates on that topic?
Another case this comes up is if you want to use load_summarize_chain, but with the more affordable text-curie-001 model (one of curie's use cases is faster/cheaper summarization). This constructs the final prompt correctly:
summary_chain = load_summarize_chain(llm=OpenAI(model="text-curie-001", temperature=0), chain_type="map_reduce", verbose=True)
poss_summary = summary_chain.run(pages).strip(" \n\t")
But errors on use because
This model's maximum context length is 2049 tokens, however you requested 3230 tokens (2974 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.
I thought I might be able to pass token_max=1700 in as a kwarg, but I get
1 validation error for MapReduceDocumentsChain
kwargs
extra fields not permitted (type=value_error.extra)
For those that are still struggling with this, here is some code I wrote to get around this for now. Note that this doesn't use documents, though it could be easily converted to do so. Hope this helps people while the Langchain contributors work on this issue!
import tiktoken
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
ChatPromptTemplate,
SystemMessagePromptTemplate,
HumanMessagePromptTemplate,
)
ENCODING = tiktoken.get_encoding("cl100k_base")
SUMMARIZE_MODEL = ChatOpenAI(model="gpt-3.5-turbo-0613", temperature=0.2)
MAX_TOKENS_SUMMARY = 3000
SUMMARY_SYS_MSG = """You are SummaryGPT, a model designed to ingest content and summarize it concisely and accurately.
You will receive an input string, and your response will be a summary of this information."""
def token_len(input: str) -> int:
"""Get token length for openai"""
return len(ENCODING.encode(input))
def chunk(input: str) -> list:
input_tokens = token_len(input)
count = math.ceil(input_tokens / MAX_TOKENS_SUMMARY)
k, m = divmod(len(input), count)
chunks = [
input[i * k + min(i, m) : (i + 1) * k + min(i + 1, m)] for i in range(count)
]
return chunks
def summarize(input: str) -> str:
system_message = SystemMessagePromptTemplate.from_template(
template=SUMMARY_SYS_MSG
)
human_message = HumanMessagePromptTemplate.from_template(
template="Input: {input}"
)
chunks = chunk(input=input)
summary = ""
for i in chunks:
prompt = ChatPromptTemplate(
input_variables=["input"],
messages=[system_message, human_message],
)
_input = prompt.format_prompt(input=i)
output = SUMMARIZE_MODEL(_input.to_messages())
summary += f"\n{output.content}"
sum_tokens = token_len(input=summary)
if sum_tokens > MAX_TOKENS_SUMMARY:
return summarize(input=summary)
return summary
@jake-landersweb @nezhazheng @ihorizons2022 What you all have been asking for, is precisely how the langchain JS lib has implemented mapreduce. In fact, I switched from Langchain JS -> Python and was stumped to see that the mapreduce chain was so vastly different and that this was undocumented. In the JS version it was rather straightforward to reason about the mapreduce. I have messaged the authors asking about this design decision.
https://github.com/hwchase17/langchainjs/blob/89b1d8cced16be384e468d01e1a89d658f3f8f70/langchain/src/chains/combine_docs_chain.ts#L165
@jake-landersweb @scottrblock @SaschaHeyer @nezhazheng @adumont @neicras I am discussing with langchain team regarding overhauling the mapreduce implementation itself. My aim is to incorporate iterative mapping, like in the JS version or at least an equivalent into the Python version.
If you want to be able to set token_max until then here is how you can do that :) I see this hasn't been suggested elsewhere and no one else has trawled through the chain's code to figure out how to pass in the kwargs for token_max so here it is:
res = await chain(inputs={'input_documents': texts, 'token_max': 12000}, return_only_outputs=True)
@jake-landersweb @scottrblock @SaschaHeyer @nezhazheng @adumont @neicras https://github.com/hwchase17/langchain/pull/6994 This should solve most issues related to this. Also the token_max can now be passed in load_summarize_chain or as an initializing arg to the Reduce Chain. That should be merged in the the next version bump.
When will this be available?
@jake-landersweb @scottrblock @SaschaHeyer @nezhazheng @adumont @neicras I am discussing with langchain team regarding overhauling the mapreduce implementation itself. My aim is to incorporate iterative mapping, like in the JS version or at least an equivalent into the Python version.
If you want to be able to set
token_maxuntil then here is how you can do that :) I see this hasn't been suggested elsewhere and no one else has trawled through the chain's code to figure out how to pass in the kwargs for token_max so here it is:res = await chain(inputs={'input_documents': texts, 'token_max': 12000}, return_only_outputs=True)
It didn't work from my side, still got the error
InvalidRequestError: This model's maximum context length is 4096 tokens. However, your messages resulted in 4221 tokens....
Would anyone know how can we change the maximum context length? I am using gpt-4-32k, it shouldn't be 4096
I also encountered this problem, but I haven't found a solution yet. I only asked 1+1, and this error occurred. I want to know how to see the content of prompt and completion when he said that the length is exceeded.
I also encountered this problem, but I haven't found a solution yet. I only asked 1+1, and this error occurred. I want to know how to see the content of prompt and completion when he said that the length is exceeded.
turn on debug mode or locate 'llm.py' -> find the 'generate' func and print
@Lucas-Li-XW @GeneralLHW https://github.com/langchain-ai/langchain/pull/7183
hi, @ugfly1210 , I try to break the point and break the point in the source code of langchain, but the program will not enter the source code and pause. I use anaconda notebooks