langchain `load_qa_chain` with `map_reduce` results in "Token indices sequence length" error

Whenever I run code like

chain = load_qa_chain(llm=flan_t5_xxl, chain_type="map_reduce")
answer = chain({"input_documents": split_docs, "question": query), return_only_outputs=True)

I get first a warning:

Token indices sequence length is longer than the specified maximum length for this model

followed by an error, again about there being too many tokens.

Some observations:

The error occurs no matter what the document input is: even if there is only a single input document of a few characters.
It doesn't happen when the chain_type is map_rerank.
It doesn't happen using load_summarize_chain and map_reduce together.

Is there a fix for this? I thought about modifying the tokenizer config but I can't find a way to do that except with locally-loaded models, and to save RAM I prefer to use the model remotely (is that even a practical approach long-term?).

Apr 29 '23 23:04 oranda

I found a note by @hwchase17 on adding a token_max parameter but I still get errors.

I also tried swapping ChromaDB for FAISS but still got this error.

I notice in the past this type of error was also encountered by @Kamalabot, @rohan-uiuc, @nqbao, @vertinski, @happysalada, @yunyu, @karndeepsingh, and @yu-iskw.

I wonder if anybody came to a good understanding of this token length limit and managed to overcome it in a non-OpenAI LLM used remotely from HuggingFaceHub?

Apr 30 '23 16:04 oranda

I think I figured it out. The QA chain has some default text for the combine prompt which is too long for the flan_t5_xxl LLM on HuggingFace. This can be overcome by passing in a custom combine_prompt to load_qa_chain.

The error messages from LangChain could be better here.

I think it was sending in about 1500 tokens by default and the flan T5 limit is 1024. But I'm not sure, it's not clear to me from the model card on Hugging Face. Just curious, can anybody advise me on where I can quickly see this kind of thing?

May 01 '23 17:05 oranda

Flan-T5 models were trained on 2k token input windows and 512 output windows so should be able to manage pretty long in-context sequences. Are you using HuggingFace Inference API or your model is local?

May 02 '23 00:05 Kamalabot

I run code like

chain = load_qa_chain(llm=chatglm, chain_type="map_reduce", return_map_steps=True)
chain({"input_documents": search_docs_Documents, "question": query}, return_only_outputs=True)

I get first a warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (4600 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1588 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2386 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3156 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3020 > 1024). Running this sequence through the model will result in indexing errors

and some errors:

/tmp/ipykernel_262002/49478104.py:4 in <module>                                                  │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_262002/49478104.py'                         │
│                                                                                                  │
│ /tmp/ipykernel_262002/14951549.py:11 in answer_docs                                              │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_262002/14951549.py'                         │
│                                                                                                  │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/base.py:116 in   │
│ __call__                                                                                         │
│                                                                                                  │
│   113 │   │   │   outputs = self._call(inputs)                                                   │
│   114 │   │   except (KeyboardInterrupt, Exception) as e:                                        │
│   115 │   │   │   self.callback_manager.on_chain_error(e, verbose=self.verbose)                  │
│ ❱ 116 │   │   │   raise e                                                                        │
│   117 │   │   self.callback_manager.on_chain_end(outputs, verbose=self.verbose)                  │
│   118 │   │   return self.prep_outputs(inputs, outputs, return_only_outputs)                     │
│   119                                                                                            │
│                                                                                                  │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/base.py:113 in   │
│ __call__                                                                                         │
│                                                                                                  │
│   110 │   │   │   verbose=self.verbose,                                                          │
│   111 │   │   )                                                                                  │
│   112 │   │   try:                                                                               │
│ ❱ 113 │   │   │   outputs = self._call(inputs)                                                   │
│   114 │   │   except (KeyboardInterrupt, Exception) as e:                                        │
│   115 │   │   │   self.callback_manager.on_chain_error(e, verbose=self.verbose)                  │
│   116 │   │   │   raise e                                                                        │
│                                                                                                  │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_document │
│ s/base.py:75 in _call                                                                            │
│                                                                                                  │
│    72 │   │   docs = inputs[self.input_key]                                                      │
│    73 │   │   # Other keys are assumed to be needed for LLM prediction                           │
│    74 │   │   other_keys = {k: v for k, v in inputs.items() if k != self.input_key}              │
│ ❱  75 │   │   output, extra_return_dict = self.combine_docs(docs, **other_keys)                  │
│    76 │   │   extra_return_dict[self.output_key] = output                                        │
│    77 │   │   return extra_return_dict                                                           │
│    78                                                                                            │
│                                                                                                  │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_document │
│ s/map_reduce.py:143 in combine_docs                                                              │
│                                                                                                  │
│   140 │   │   │   # FYI - this is parallelized and so it is fast.                                │
│   141 │   │   │   [{**{self.document_variable_name: d.page_content}, **kwargs} for d in docs]    │
│   142 │   │   )                                                                                  │
│ ❱ 143 │   │   return self._process_results(results, docs, token_max, **kwargs)                   │
│   144 │                                                                                          │
│   145 │   async def acombine_docs(                                                               │
│   146 │   │   self, docs: List[Document], **kwargs: Any                                          │
│                                                                                                  │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_document │
│ s/map_reduce.py:179 in _process_results                                                          │
│                                                                                                  │
│   176 │   │   │   return self._collapse_chain.run(input_documents=docs, **kwargs)                │
│   177 │   │                                                                                      │
│   178 │   │   while num_tokens is not None and num_tokens > token_max:                           │
│ ❱ 179 │   │   │   new_result_doc_list = _split_list_of_docs(                                     │
│   180 │   │   │   │   result_docs, length_func, token_max, **kwargs                              │
│   181 │   │   │   )                                                                              │
│   182 │   │   │   result_docs = []                                                               │
│                                                                                                  │
│ /home/hysz/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_document │
│ s/map_reduce.py:36 in _split_list_of_docs                                                        │
│                                                                                                  │
│    33 │   │   │   │   │   " we cannot handle this."                                              │
│    34 │   │   │   │   )                                                                          │
│    35 │   │   │   if len(_sub_result_docs) == 2:                                                 │
│ ❱  36 │   │   │   │   raise ValueError(                                                          │
│    37 │   │   │   │   │   "A single document was so long it could not be combined "              │
│    38 │   │   │   │   │   "with another document, we cannot handle this."                        │
│    39 │   │   │   │   )                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: A single document was so long it could not be combined with another document, we cannot handle this.

May 02 '23 11:05 flaviadeutsch

Flan-T5 models were trained on 2k token input windows and 512 output windows so should be able to manage pretty long in-context sequences. Are you using HuggingFace Inference API or your model is local?

HuggingFace inference remote.

May 02 '23 18:05 oranda

ValueError: A single document was so long it could not be combined with another document, we cannot handle this.

Looks like the same problem I had. And I think the solution is to modify the prompts so that not too much content is flowing through in the intermediate steps. Assuming the initial splitting is right.

May 02 '23 18:05 oranda

I was struggling with this all day and I know what is the problems but not sure if it's on purpose or what.

The prompt templates, for the map_reduce has a big text that I think it not supposes to be there. (is using a text of the sample state_of_the_union). Check the next two files.

https://github.com/hwchase17/langchain/blob/master/langchain/chains/qa_with_sources/map_reduce_prompt.py and https://github.com/hwchase17/langchain/blob/master/langchain/chains/question_answering/map_reduce_prompt.py

Is saying that you are out of tokens because of those very long questions and answers in the combine_prompt_template variable.

May 04 '23 14:05 carlgira

Yes @carlgira I found that too. I was suggesting the workaround of passing in a custom combine_prompt to load_qa_chain.

May 04 '23 21:05 oranda

@oranda @carlgira how would you pass a custom combine_prompt, I am using RetrievalQAWithSourcesChain and face the same issue

chain = RetrievalQAWithSourcesChain.from_chain_type(llm, chain_type="map_reduce", \
    retriever=db_instructEmbedd.as_retriever(),verbose=True)

May 25 '23 06:05 zubair-ahmed-ai

@oranda @carlgira how would you pass a custom combine_prompt, I am using RetrievalQAWithSourcesChain and face the same issue

this should work per https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa_with_sources.html#chain-type

qa_chain = load_qa_with_sources_chain(llm, chain_type="map_reduce", question_prompt=QUESTION_PROMPT, combine_prompt=COMBINE_PROMPT, verbose=True)
chain = RetrievalQAWithSourcesChain(combine_documents_chain=qa_chain, retriever=db_instructEmbedd.as_retriever())

Jun 05 '23 15:06 chenyang-zheng

I am not convinced this is an input token limit altogether. I am using PaLM which has a context window of 8k which is more than ample for my text chunks but get the same warning.

Jun 27 '23 16:06 Majidbadal

Did you try to change the chain_type to Stuff? Its a simplest one. On the contrary, Map_Reduce first collects for data for each document, and then combines them & sends it to LLM. Take a look at the docs. https://python.langchain.com/docs/modules/chains/document/map_reduce

Jun 28 '23 02:06 Kamalabot

The warnings come from a GPTtokenizer by the transformer's library: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L3602

def _eventual_warn_about_too_long_sequence(self, ids: List[int], max_length: Optional[int], verbose: bool):
        """
        Depending on the input and internal state we might trigger a warning about a sequence that is too long for its
        corresponding model

        Args:
            ids (`List[str]`): The ids produced by the tokenization
            max_length (`int`, *optional*): The max_length desired (does not trigger a warning if it is set)
            verbose (`bool`): Whether or not to print more information and warnings.

        """
        if max_length is None and len(ids) > self.model_max_length and verbose:
            if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False):
                logger.warning(
                    "Token indices sequence length is longer than the specified maximum sequence length "
                    f"for this model ({len(ids)} > {self.model_max_length}). Running this sequence through the model "
                    "will result in indexing errors"
                )
            self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True

It is just a default value by transformers so I guess can be safely ignored for now

Jun 28 '23 20:06 Majidbadal

Hi, @oranda. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is related to running load_qa_chain with map_reduce resulting in a "Token indices sequence length" error. You mentioned that the default combine prompt for the QA chain is too long for the flan_t5_xxl LLM on HuggingFace. As a workaround, you suggested passing a custom combine_prompt to load_qa_chain. There have been discussions among other users about possible solutions, such as modifying the prompt templates or using a different chain type.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

Sep 27 '23 16:09 dosubot[bot]

langchain langchain copied to clipboard

`load_qa_chain` with `map_reduce` results in "Token indices sequence length" error

langchain
langchain copied to clipboard