langchain
langchain copied to clipboard
`load_qa_chain` with `map_reduce` results in "Token indices sequence length" error
Whenever I run code like
chain = load_qa_chain(llm=flan_t5_xxl, chain_type="map_reduce")
answer = chain({"input_documents": split_docs, "question": query), return_only_outputs=True)
I get first a warning:
Token indices sequence length is longer than the specified maximum length for this model
followed by an error, again about there being too many tokens.
Some observations:
- The error occurs no matter what the document input is: even if there is only a single input document of a few characters.
- It doesn't happen when the chain_type is
map_rerank
. - It doesn't happen using
load_summarize_chain
andmap_reduce
together.
Is there a fix for this? I thought about modifying the tokenizer config but I can't find a way to do that except with locally-loaded models, and to save RAM I prefer to use the model remotely (is that even a practical approach long-term?).
I found a note by @hwchase17 on adding a token_max
parameter but I still get errors.
I also tried swapping ChromaDB for FAISS but still got this error.
I notice in the past this type of error was also encountered by @Kamalabot, @rohan-uiuc, @nqbao, @vertinski, @happysalada, @yunyu, @karndeepsingh, and @yu-iskw.
I wonder if anybody came to a good understanding of this token length limit and managed to overcome it in a non-OpenAI LLM used remotely from HuggingFaceHub
?
I think I figured it out. The QA chain has some default text for the combine prompt which is too long for the flan_t5_xxl LLM on HuggingFace. This can be overcome by passing in a custom combine_prompt
to load_qa_chain
.
The error messages from LangChain could be better here.
I think it was sending in about 1500 tokens by default and the flan T5 limit is 1024. But I'm not sure, it's not clear to me from the model card on Hugging Face. Just curious, can anybody advise me on where I can quickly see this kind of thing?
Flan-T5 models were trained on 2k token input windows and 512 output windows so should be able to manage pretty long in-context sequences. Are you using HuggingFace Inference API or your model is local?
I run code like
chain = load_qa_chain(llm=chatglm, chain_type="map_reduce", return_map_steps=True)
chain({"input_documents": search_docs_Documents, "question": query}, return_only_outputs=True)
I get first a warning:
Token indices sequence length is longer than the specified maximum sequence length for this model (4600 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1588 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2386 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3156 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3020 > 1024). Running this sequence through the model will result in indexing errors
and some errors:
/tmp/ipykernel_262002/49478104.py:4 in <module> │
│ │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_262002/49478104.py' │
│ │
│ /tmp/ipykernel_262002/14951549.py:11 in answer_docs │
│ │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_262002/14951549.py' │
│ │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/base.py:116 in │
│ __call__ │
│ │
│ 113 │ │ │ outputs = self._call(inputs) │
│ 114 │ │ except (KeyboardInterrupt, Exception) as e: │
│ 115 │ │ │ self.callback_manager.on_chain_error(e, verbose=self.verbose) │
│ ❱ 116 │ │ │ raise e │
│ 117 │ │ self.callback_manager.on_chain_end(outputs, verbose=self.verbose) │
│ 118 │ │ return self.prep_outputs(inputs, outputs, return_only_outputs) │
│ 119 │
│ │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/base.py:113 in │
│ __call__ │
│ │
│ 110 │ │ │ verbose=self.verbose, │
│ 111 │ │ ) │
│ 112 │ │ try: │
│ ❱ 113 │ │ │ outputs = self._call(inputs) │
│ 114 │ │ except (KeyboardInterrupt, Exception) as e: │
│ 115 │ │ │ self.callback_manager.on_chain_error(e, verbose=self.verbose) │
│ 116 │ │ │ raise e │
│ │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_document │
│ s/base.py:75 in _call │
│ │
│ 72 │ │ docs = inputs[self.input_key] │
│ 73 │ │ # Other keys are assumed to be needed for LLM prediction │
│ 74 │ │ other_keys = {k: v for k, v in inputs.items() if k != self.input_key} │
│ ❱ 75 │ │ output, extra_return_dict = self.combine_docs(docs, **other_keys) │
│ 76 │ │ extra_return_dict[self.output_key] = output │
│ 77 │ │ return extra_return_dict │
│ 78 │
│ │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_document │
│ s/map_reduce.py:143 in combine_docs │
│ │
│ 140 │ │ │ # FYI - this is parallelized and so it is fast. │
│ 141 │ │ │ [{**{self.document_variable_name: d.page_content}, **kwargs} for d in docs] │
│ 142 │ │ ) │
│ ❱ 143 │ │ return self._process_results(results, docs, token_max, **kwargs) │
│ 144 │ │
│ 145 │ async def acombine_docs( │
│ 146 │ │ self, docs: List[Document], **kwargs: Any │
│ │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_document │
│ s/map_reduce.py:179 in _process_results │
│ │
│ 176 │ │ │ return self._collapse_chain.run(input_documents=docs, **kwargs) │
│ 177 │ │ │
│ 178 │ │ while num_tokens is not None and num_tokens > token_max: │
│ ❱ 179 │ │ │ new_result_doc_list = _split_list_of_docs( │
│ 180 │ │ │ │ result_docs, length_func, token_max, **kwargs │
│ 181 │ │ │ ) │
│ 182 │ │ │ result_docs = [] │
│ │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_document │
│ s/map_reduce.py:36 in _split_list_of_docs │
│ │
│ 33 │ │ │ │ │ " we cannot handle this." │
│ 34 │ │ │ │ ) │
│ 35 │ │ │ if len(_sub_result_docs) == 2: │
│ ❱ 36 │ │ │ │ raise ValueError( │
│ 37 │ │ │ │ │ "A single document was so long it could not be combined " │
│ 38 │ │ │ │ │ "with another document, we cannot handle this." │
│ 39 │ │ │ │ ) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: A single document was so long it could not be combined with another document, we cannot handle this.
Flan-T5 models were trained on 2k token input windows and 512 output windows so should be able to manage pretty long in-context sequences. Are you using HuggingFace Inference API or your model is local?
HuggingFace inference remote.
ValueError: A single document was so long it could not be combined with another document, we cannot handle this.
Looks like the same problem I had. And I think the solution is to modify the prompts so that not too much content is flowing through in the intermediate steps. Assuming the initial splitting is right.
I was struggling with this all day and I know what is the problems but not sure if it's on purpose or what.
The prompt templates, for the map_reduce has a big text that I think it not supposes to be there. (is using a text of the sample state_of_the_union). Check the next two files.
https://github.com/hwchase17/langchain/blob/master/langchain/chains/qa_with_sources/map_reduce_prompt.py and https://github.com/hwchase17/langchain/blob/master/langchain/chains/question_answering/map_reduce_prompt.py
Is saying that you are out of tokens because of those very long questions and answers in the combine_prompt_template variable.
Yes @carlgira I found that too. I was suggesting the workaround of passing in a custom combine_prompt
to load_qa_chain
.
@oranda @carlgira how would you pass a custom combine_prompt, I am using RetrievalQAWithSourcesChain
and face the same issue
chain = RetrievalQAWithSourcesChain.from_chain_type(llm, chain_type="map_reduce", \
retriever=db_instructEmbedd.as_retriever(),verbose=True)
@oranda @carlgira how would you pass a custom combine_prompt, I am using
RetrievalQAWithSourcesChain
and face the same issue
this should work per https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa_with_sources.html#chain-type
qa_chain = load_qa_with_sources_chain(llm, chain_type="map_reduce", question_prompt=QUESTION_PROMPT, combine_prompt=COMBINE_PROMPT, verbose=True)
chain = RetrievalQAWithSourcesChain(combine_documents_chain=qa_chain, retriever=db_instructEmbedd.as_retriever())
I am not convinced this is an input token limit altogether. I am using PaLM which has a context window of 8k which is more than ample for my text chunks but get the same warning.
Did you try to change the chain_type to Stuff? Its a simplest one. On the contrary, Map_Reduce first collects for data for each document, and then combines them & sends it to LLM. Take a look at the docs. https://python.langchain.com/docs/modules/chains/document/map_reduce
The warnings come from a GPTtokenizer by the transformer's library: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L3602
def _eventual_warn_about_too_long_sequence(self, ids: List[int], max_length: Optional[int], verbose: bool):
"""
Depending on the input and internal state we might trigger a warning about a sequence that is too long for its
corresponding model
Args:
ids (`List[str]`): The ids produced by the tokenization
max_length (`int`, *optional*): The max_length desired (does not trigger a warning if it is set)
verbose (`bool`): Whether or not to print more information and warnings.
"""
if max_length is None and len(ids) > self.model_max_length and verbose:
if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False):
logger.warning(
"Token indices sequence length is longer than the specified maximum sequence length "
f"for this model ({len(ids)} > {self.model_max_length}). Running this sequence through the model "
"will result in indexing errors"
)
self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True
It is just a default value by transformers so I guess can be safely ignored for now
Hi, @oranda. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue is related to running load_qa_chain
with map_reduce
resulting in a "Token indices sequence length" error. You mentioned that the default combine prompt for the QA chain is too long for the flan_t5_xxl LLM on HuggingFace. As a workaround, you suggested passing a custom combine_prompt
to load_qa_chain
. There have been discussions among other users about possible solutions, such as modifying the prompt templates or using a different chain type.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!