langchain icon indicating copy to clipboard operation
langchain copied to clipboard

model's maximum context length

Open neicras opened this issue 2 years ago • 19 comments

have been very often running into openai.error.InvalidRequestError for getting over 4097 tokens maximum context length. Is there a module/ best practice to manage the context length ?

Quick example I am adding the map reduce summary chain to the URL data loader and its throwing that error:

from langchain.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]

loader = UnstructuredURLLoader(urls=urls)

data = loader.load()

from langchain.chains.summarize import load_summarize_chain
from langchain import OpenAI
import os
os.environ["OPENAI_API_KEY"] = "*"

llm = OpenAI(temperature=0)
chain = load_summarize_chain(llm, chain_type="map_reduce")
print(chain.run(data))

Error: openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 8356 tokens (8100 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

neicras avatar Mar 01 '23 03:03 neicras

How did you manage this issue?

AldawsariNLP avatar Mar 14 '23 20:03 AldawsariNLP

My case was a bit simpler, in which I was providing context plus questions and retrieving the answers, I was appending the in the message, but I think in your case check that the data you passing is not repeating, if not then pass the data into chunks

I think I confuse my issue with yours, Your case might be different from the one which I faced

106AbdulBasit avatar Mar 15 '23 08:03 106AbdulBasit

I have the same issue. I try to set chunk size for a splitter and it works. text_splitter = CharacterTextSplitter(chunk_size=3000) docs = WebBaseLoader(url).load_and_split(text_splitter)

hongweihao avatar Mar 17 '23 07:03 hongweihao

get same error any work around?

anupam-tiwari avatar Mar 21 '23 22:03 anupam-tiwari

get the same error

joqk12345 avatar Mar 30 '23 03:03 joqk12345

I have the same issue when using PyPDFLoader with load_and_split method.

tobegit3hub avatar Mar 31 '23 03:03 tobegit3hub

Same, with Character Text Splitter. Without having looked at the source, my hunch is that the chunking seems to only use the chunk size as a reference but actually chunks on the nearest line break or other character. So there does not seem to be any guarantee that a chunk fits in the context length.

If you have this issue with e.g. the CharacterTextSplitter a work around is to use the RecursiveCharacterTextSplitter and set a couple of separators that work for you. This reduces the chance of having a chunk that doesn't fit, e.g.:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=0, separators=[" ", ",", "\n"]
    )

In general, reduce chunk size and set a separator that appears frequently.

nilsec avatar Mar 31 '23 08:03 nilsec

I have the same error with CharacterTextSplitter

uabbas avatar Mar 31 '23 19:03 uabbas

I have the same issue with [Recursive]CharacterTextSplitter . Specifying custom separators didn't help.

nikkolasg avatar Apr 09 '23 20:04 nikkolasg

try to use chain = RetrievalQAWithSourcesChain.from_chain_type(llm, chain_type="stuff", retriever=db.as_retriever(), reduce_k_below_max_tokens=True,)

ugfly1210 avatar Apr 13 '23 05:04 ugfly1210

@nilsec answer worked for me. I switched now to RecursiveCharacterTextSplitter:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000, chunk_overlap=0, separators=[" ", ",", "\n"]
    )

yachty66 avatar Apr 19 '23 22:04 yachty66

The problem is not with separation initial data into chunks of right size. The problem is the following: after creating summaries for each chunk it collects them into one prompt and it could reach the limit. So, bigger chunks could help in some cases, but if there are many splitted documents, after some amount it still will reach token limit. The issue is still present. I think the correct solution is that load_summarize_chain should check last request for limit based on some parameter like max_tokens or based on chosen model. And split it again for a couple of requests. Perhaps, for requests there should be recursive algorithm based on limit size. Example:

  1. last query for generation final summary based on sub-summaries is more than tokens limit
  2. it should be splitted into two requests and find summary for both of them (or again splitted)
  3. try again to generate final summary

unavailabl3 avatar Apr 25 '23 10:04 unavailabl3

The map_reduce should in fact apply recursively, not only once and then hope all the summaries will magically fit into one prompt. It's likely the concatenated summaries won't, and in such case the chain should apply again.

adumont avatar May 28 '23 15:05 adumont

indeed, map_reduce should be applied recursively to fit for 4096

ihorizons2022 avatar May 30 '23 02:05 ihorizons2022

The process should at least give a better trace of the error since it is hard to understand what is happening at first sight.

It could be a bit dangerous to have a recursive process here in my view, even if you set a max_depth. A different feature could be for example to allow you to process the intermediate summaries (LLMs are very verbose, you can easily clean up and save tokens) before passing it to the combined summary.

luisroque avatar May 31 '23 10:05 luisroque

The recursiveness and cost impacts of course relevant. It could be optional, but still I think it would benefit the user case.

The alternative today is an error. It simply doesn't work for big documents (or I missed something).

El mié, 31 may 2023, 12:48, Luis Roque @.***> escribió:

The process should at least give a better trace of the error since it is hard to understand what is happening at first sight.

It could be a bit dangerous to have a recursive process here in my view, even if you set a max_depth. A different feature could be for example to allow you to process the intermediate summaries (LLMs are very verbose, you can easily clean up and save tokens) before passing it to the combined summary.

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/1349#issuecomment-1569958767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRLWQXQWZNSA5JLRTCBCTXI4OZLANCNFSM6AAAAAAVLQHNUM . You are receiving this because you commented.Message ID: @.***>

adumont avatar May 31 '23 11:05 adumont

I think the use case here is not to process very large documents (for that, you have to consider other approaches such as using a vectorDB and semantic search) but to extend mildly the token limits from where they are today. Even so, I agree that the error is not verbose, and you should get a bit more control over what the token extension could look like. Are we talking about ingesting a doc of ~40k tokens in an API with a limit of ~4k? So 10x increase? Or just 3x or 4x. I understand the complexity since the models are not deterministic and can output very different things, especially considering that you can change the prompt. But that is why I shared the possibility of processing the intermediate summaries to save tokens since the outputs of the models are very verbose.

luisroque avatar Jun 02 '23 16:06 luisroque

+1 for map reduce support recursive algorithm

nezhazheng avatar Jun 08 '23 01:06 nezhazheng

Any updates on that topic?

SaschaHeyer avatar Jun 13 '23 19:06 SaschaHeyer

Another case this comes up is if you want to use load_summarize_chain, but with the more affordable text-curie-001 model (one of curie's use cases is faster/cheaper summarization). This constructs the final prompt correctly:

summary_chain = load_summarize_chain(llm=OpenAI(model="text-curie-001", temperature=0), chain_type="map_reduce", verbose=True)
poss_summary = summary_chain.run(pages).strip(" \n\t")

But errors on use because

This model's maximum context length is 2049 tokens, however you requested 3230 tokens (2974 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

I thought I might be able to pass token_max=1700 in as a kwarg, but I get

1 validation error for MapReduceDocumentsChain
kwargs
extra fields not permitted (type=value_error.extra)

scottrblock avatar Jun 20 '23 02:06 scottrblock

For those that are still struggling with this, here is some code I wrote to get around this for now. Note that this doesn't use documents, though it could be easily converted to do so. Hope this helps people while the Langchain contributors work on this issue!

import tiktoken
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

ENCODING = tiktoken.get_encoding("cl100k_base")
SUMMARIZE_MODEL = ChatOpenAI(model="gpt-3.5-turbo-0613", temperature=0.2)
MAX_TOKENS_SUMMARY = 3000
SUMMARY_SYS_MSG = """You are SummaryGPT, a model designed to ingest content and summarize it concisely and accurately.
You will receive an input string, and your response will be a summary of this information."""


def token_len(input: str) -> int:
    """Get token length for openai"""
    return len(ENCODING.encode(input))

def chunk(input: str) -> list:
    input_tokens = token_len(input)
    count = math.ceil(input_tokens / MAX_TOKENS_SUMMARY)
    k, m = divmod(len(input), count)
    chunks = [
        input[i * k + min(i, m) : (i + 1) * k + min(i + 1, m)] for i in range(count)
    ]
    return chunks

def summarize(input: str) -> str:
    system_message = SystemMessagePromptTemplate.from_template(
        template=SUMMARY_SYS_MSG
    )
    human_message = HumanMessagePromptTemplate.from_template(
        template="Input: {input}"
    )

    chunks = chunk(input=input)

    summary = ""

    for i in chunks:
        prompt = ChatPromptTemplate(
            input_variables=["input"],
            messages=[system_message, human_message],
        )

        _input = prompt.format_prompt(input=i)
        output = SUMMARIZE_MODEL(_input.to_messages())
        summary += f"\n{output.content}"

    sum_tokens = token_len(input=summary)

    if sum_tokens > MAX_TOKENS_SUMMARY:
        return summarize(input=summary)

    return summary

jake-landersweb avatar Jun 28 '23 07:06 jake-landersweb

@jake-landersweb @nezhazheng @ihorizons2022 What you all have been asking for, is precisely how the langchain JS lib has implemented mapreduce. In fact, I switched from Langchain JS -> Python and was stumped to see that the mapreduce chain was so vastly different and that this was undocumented. In the JS version it was rather straightforward to reason about the mapreduce. I have messaged the authors asking about this design decision.

https://github.com/hwchase17/langchainjs/blob/89b1d8cced16be384e468d01e1a89d658f3f8f70/langchain/src/chains/combine_docs_chain.ts#L165

ShantanuNair avatar Jun 30 '23 00:06 ShantanuNair

@jake-landersweb @scottrblock @SaschaHeyer @nezhazheng @adumont @neicras I am discussing with langchain team regarding overhauling the mapreduce implementation itself. My aim is to incorporate iterative mapping, like in the JS version or at least an equivalent into the Python version.

If you want to be able to set token_max until then here is how you can do that :) I see this hasn't been suggested elsewhere and no one else has trawled through the chain's code to figure out how to pass in the kwargs for token_max so here it is:

res = await chain(inputs={'input_documents': texts, 'token_max': 12000}, return_only_outputs=True)

ShantanuNair avatar Jul 05 '23 07:07 ShantanuNair

@jake-landersweb @scottrblock @SaschaHeyer @nezhazheng @adumont @neicras https://github.com/hwchase17/langchain/pull/6994 This should solve most issues related to this. Also the token_max can now be passed in load_summarize_chain or as an initializing arg to the Reduce Chain. That should be merged in the the next version bump.

ShantanuNair avatar Jul 05 '23 15:07 ShantanuNair

When will this be available?

JassimranK avatar Jul 05 '23 23:07 JassimranK

@jake-landersweb @scottrblock @SaschaHeyer @nezhazheng @adumont @neicras I am discussing with langchain team regarding overhauling the mapreduce implementation itself. My aim is to incorporate iterative mapping, like in the JS version or at least an equivalent into the Python version.

If you want to be able to set token_max until then here is how you can do that :) I see this hasn't been suggested elsewhere and no one else has trawled through the chain's code to figure out how to pass in the kwargs for token_max so here it is:

res = await chain(inputs={'input_documents': texts, 'token_max': 12000}, return_only_outputs=True)

It didn't work from my side, still got the error InvalidRequestError: This model's maximum context length is 4096 tokens. However, your messages resulted in 4221 tokens....

Would anyone know how can we change the maximum context length? I am using gpt-4-32k, it shouldn't be 4096

Lucas-Li-XW avatar Jul 14 '23 15:07 Lucas-Li-XW

I also encountered this problem, but I haven't found a solution yet. I only asked 1+1, and this error occurred. I want to know how to see the content of prompt and completion when he said that the length is exceeded.

GeneralLHW avatar Aug 09 '23 01:08 GeneralLHW

I also encountered this problem, but I haven't found a solution yet. I only asked 1+1, and this error occurred. I want to know how to see the content of prompt and completion when he said that the length is exceeded.

turn on debug mode or locate 'llm.py' -> find the 'generate' func and print

ugfly1210 avatar Aug 09 '23 01:08 ugfly1210

@Lucas-Li-XW @GeneralLHW https://github.com/langchain-ai/langchain/pull/7183

ShantanuNair avatar Aug 09 '23 04:08 ShantanuNair

hi, @ugfly1210 , I try to break the point and break the point in the source code of langchain, but the program will not enter the source code and pause. I use anaconda notebooks

GeneralLHW avatar Aug 09 '23 10:08 GeneralLHW