langchain model's maximum context length

have been very often running into openai.error.InvalidRequestError for getting over 4097 tokens maximum context length. Is there a module/ best practice to manage the context length ?

Quick example I am adding the map reduce summary chain to the URL data loader and its throwing that error:

from langchain.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]

loader = UnstructuredURLLoader(urls=urls)

data = loader.load()

from langchain.chains.summarize import load_summarize_chain
from langchain import OpenAI
import os
os.environ["OPENAI_API_KEY"] = "*"

llm = OpenAI(temperature=0)
chain = load_summarize_chain(llm, chain_type="map_reduce")
print(chain.run(data))

Error: openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 8356 tokens (8100 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

Mar 01 '23 03:03 neicras

How did you manage this issue?

Mar 14 '23 20:03 AldawsariNLP

My case was a bit simpler, in which I was providing context plus questions and retrieving the answers, I was appending the in the message, but I think in your case check that the data you passing is not repeating, if not then pass the data into chunks

I think I confuse my issue with yours, Your case might be different from the one which I faced

Mar 15 '23 08:03 106AbdulBasit

I have the same issue. I try to set chunk size for a splitter and it works. text_splitter = CharacterTextSplitter(chunk_size=3000) docs = WebBaseLoader(url).load_and_split(text_splitter)

Mar 17 '23 07:03 hongweihao

get same error any work around?

Mar 21 '23 22:03 anupam-tiwari

get the same error

Mar 30 '23 03:03 joqk12345

I have the same issue when using PyPDFLoader with load_and_split method.

Mar 31 '23 03:03 tobegit3hub

Same, with Character Text Splitter. Without having looked at the source, my hunch is that the chunking seems to only use the chunk size as a reference but actually chunks on the nearest line break or other character. So there does not seem to be any guarantee that a chunk fits in the context length.

If you have this issue with e.g. the CharacterTextSplitter a work around is to use the RecursiveCharacterTextSplitter and set a couple of separators that work for you. This reduces the chance of having a chunk that doesn't fit, e.g.:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=0, separators=[" ", ",", "\n"]
    )

In general, reduce chunk size and set a separator that appears frequently.

Mar 31 '23 08:03 nilsec

I have the same error with CharacterTextSplitter

Mar 31 '23 19:03 uabbas

I have the same issue with [Recursive]CharacterTextSplitter . Specifying custom separators didn't help.

Apr 09 '23 20:04 nikkolasg

try to use chain = RetrievalQAWithSourcesChain.from_chain_type(llm, chain_type="stuff", retriever=db.as_retriever(), reduce_k_below_max_tokens=True,)

Apr 13 '23 05:04 ugfly1210

@nilsec answer worked for me. I switched now to RecursiveCharacterTextSplitter:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000, chunk_overlap=0, separators=[" ", ",", "\n"]
    )

Apr 19 '23 22:04 yachty66

The problem is not with separation initial data into chunks of right size. The problem is the following: after creating summaries for each chunk it collects them into one prompt and it could reach the limit. So, bigger chunks could help in some cases, but if there are many splitted documents, after some amount it still will reach token limit. The issue is still present. I think the correct solution is that load_summarize_chain should check last request for limit based on some parameter like max_tokens or based on chosen model. And split it again for a couple of requests. Perhaps, for requests there should be recursive algorithm based on limit size. Example:

last query for generation final summary based on sub-summaries is more than tokens limit
it should be splitted into two requests and find summary for both of them (or again splitted)
try again to generate final summary

Apr 25 '23 10:04 unavailabl3

The map_reduce should in fact apply recursively, not only once and then hope all the summaries will magically fit into one prompt. It's likely the concatenated summaries won't, and in such case the chain should apply again.

May 28 '23 15:05 adumont

indeed, map_reduce should be applied recursively to fit for 4096

May 30 '23 02:05 ihorizons2022

The process should at least give a better trace of the error since it is hard to understand what is happening at first sight.

It could be a bit dangerous to have a recursive process here in my view, even if you set a max_depth. A different feature could be for example to allow you to process the intermediate summaries (LLMs are very verbose, you can easily clean up and save tokens) before passing it to the combined summary.

May 31 '23 10:05 luisroque

The recursiveness and cost impacts of course relevant. It could be optional, but still I think it would benefit the user case.

The alternative today is an error. It simply doesn't work for big documents (or I missed something).

El mié, 31 may 2023, 12:48, Luis Roque @.***> escribió:

The process should at least give a better trace of the error since it is hard to understand what is happening at first sight.

It could be a bit dangerous to have a recursive process here in my view, even if you set a max_depth. A different feature could be for example to allow you to process the intermediate summaries (LLMs are very verbose, you can easily clean up and save tokens) before passing it to the combined summary.

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/1349#issuecomment-1569958767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRLWQXQWZNSA5JLRTCBCTXI4OZLANCNFSM6AAAAAAVLQHNUM . You are receiving this because you commented.Message ID: @.***>

May 31 '23 11:05 adumont

I think the use case here is not to process very large documents (for that, you have to consider other approaches such as using a vectorDB and semantic search) but to extend mildly the token limits from where they are today. Even so, I agree that the error is not verbose, and you should get a bit more control over what the token extension could look like. Are we talking about ingesting a doc of ~40k tokens in an API with a limit of ~4k? So 10x increase? Or just 3x or 4x. I understand the complexity since the models are not deterministic and can output very different things, especially considering that you can change the prompt. But that is why I shared the possibility of processing the intermediate summaries to save tokens since the outputs of the models are very verbose.

Jun 02 '23 16:06 luisroque

+1 for map reduce support recursive algorithm

Jun 08 '23 01:06 nezhazheng

Any updates on that topic?

Jun 13 '23 19:06 SaschaHeyer

Another case this comes up is if you want to use load_summarize_chain, but with the more affordable text-curie-001 model (one of curie's use cases is faster/cheaper summarization). This constructs the final prompt correctly:

summary_chain = load_summarize_chain(llm=OpenAI(model="text-curie-001", temperature=0), chain_type="map_reduce", verbose=True)
poss_summary = summary_chain.run(pages).strip(" \n\t")

But errors on use because

This model's maximum context length is 2049 tokens, however you requested 3230 tokens (2974 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

I thought I might be able to pass token_max=1700 in as a kwarg, but I get

1 validation error for MapReduceDocumentsChain
kwargs
extra fields not permitted (type=value_error.extra)

Jun 20 '23 02:06 scottrblock

For those that are still struggling with this, here is some code I wrote to get around this for now. Note that this doesn't use documents, though it could be easily converted to do so. Hope this helps people while the Langchain contributors work on this issue!

import tiktoken
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

ENCODING = tiktoken.get_encoding("cl100k_base")
SUMMARIZE_MODEL = ChatOpenAI(model="gpt-3.5-turbo-0613", temperature=0.2)
MAX_TOKENS_SUMMARY = 3000
SUMMARY_SYS_MSG = """You are SummaryGPT, a model designed to ingest content and summarize it concisely and accurately.
You will receive an input string, and your response will be a summary of this information."""


def token_len(input: str) -> int:
    """Get token length for openai"""
    return len(ENCODING.encode(input))

def chunk(input: str) -> list:
    input_tokens = token_len(input)
    count = math.ceil(input_tokens / MAX_TOKENS_SUMMARY)
    k, m = divmod(len(input), count)
    chunks = [
        input[i * k + min(i, m) : (i + 1) * k + min(i + 1, m)] for i in range(count)
    ]
    return chunks

def summarize(input: str) -> str:
    system_message = SystemMessagePromptTemplate.from_template(
        template=SUMMARY_SYS_MSG
    )
    human_message = HumanMessagePromptTemplate.from_template(
        template="Input: {input}"
    )

    chunks = chunk(input=input)

    summary = ""

    for i in chunks:
        prompt = ChatPromptTemplate(
            input_variables=["input"],
            messages=[system_message, human_message],
        )

        _input = prompt.format_prompt(input=i)
        output = SUMMARIZE_MODEL(_input.to_messages())
        summary += f"\n{output.content}"

    sum_tokens = token_len(input=summary)

    if sum_tokens > MAX_TOKENS_SUMMARY:
        return summarize(input=summary)

    return summary

Jun 28 '23 07:06 jake-landersweb

@jake-landersweb @nezhazheng @ihorizons2022 What you all have been asking for, is precisely how the langchain JS lib has implemented mapreduce. In fact, I switched from Langchain JS -> Python and was stumped to see that the mapreduce chain was so vastly different and that this was undocumented. In the JS version it was rather straightforward to reason about the mapreduce. I have messaged the authors asking about this design decision.

https://github.com/hwchase17/langchainjs/blob/89b1d8cced16be384e468d01e1a89d658f3f8f70/langchain/src/chains/combine_docs_chain.ts#L165

Jun 30 '23 00:06 ShantanuNair

@jake-landersweb @scottrblock @SaschaHeyer @nezhazheng @adumont @neicras I am discussing with langchain team regarding overhauling the mapreduce implementation itself. My aim is to incorporate iterative mapping, like in the JS version or at least an equivalent into the Python version.

If you want to be able to set token_max until then here is how you can do that :) I see this hasn't been suggested elsewhere and no one else has trawled through the chain's code to figure out how to pass in the kwargs for token_max so here it is:

res = await chain(inputs={'input_documents': texts, 'token_max': 12000}, return_only_outputs=True)

Jul 05 '23 07:07 ShantanuNair

@jake-landersweb @scottrblock @SaschaHeyer @nezhazheng @adumont @neicras https://github.com/hwchase17/langchain/pull/6994 This should solve most issues related to this. Also the token_max can now be passed in load_summarize_chain or as an initializing arg to the Reduce Chain. That should be merged in the the next version bump.

Jul 05 '23 15:07 ShantanuNair

When will this be available?

Jul 05 '23 23:07 JassimranK

@jake-landersweb @scottrblock @SaschaHeyer @nezhazheng @adumont @neicras I am discussing with langchain team regarding overhauling the mapreduce implementation itself. My aim is to incorporate iterative mapping, like in the JS version or at least an equivalent into the Python version.

If you want to be able to set token_max until then here is how you can do that :) I see this hasn't been suggested elsewhere and no one else has trawled through the chain's code to figure out how to pass in the kwargs for token_max so here it is:
res = await chain(inputs={'input_documents': texts, 'token_max': 12000}, return_only_outputs=True)

It didn't work from my side, still got the error InvalidRequestError: This model's maximum context length is 4096 tokens. However, your messages resulted in 4221 tokens....

Would anyone know how can we change the maximum context length? I am using gpt-4-32k, it shouldn't be 4096

Jul 14 '23 15:07 Lucas-Li-XW

I also encountered this problem, but I haven't found a solution yet. I only asked 1+1, and this error occurred. I want to know how to see the content of prompt and completion when he said that the length is exceeded.

Aug 09 '23 01:08 GeneralLHW

I also encountered this problem, but I haven't found a solution yet. I only asked 1+1, and this error occurred. I want to know how to see the content of prompt and completion when he said that the length is exceeded.

turn on debug mode or locate 'llm.py' -> find the 'generate' func and print

Aug 09 '23 01:08 ugfly1210

@Lucas-Li-XW @GeneralLHW https://github.com/langchain-ai/langchain/pull/7183

Aug 09 '23 04:08 ShantanuNair

hi, @ugfly1210 , I try to break the point and break the point in the source code of langchain, but the program will not enter the source code and pause. I use anaconda notebooks

Aug 09 '23 10:08 GeneralLHW

langchain langchain copied to clipboard

model's maximum context length

langchain
langchain copied to clipboard