langchain icon indicating copy to clipboard operation
langchain copied to clipboard

map_reduce._split_list_of_docs has bugs

Open shangguansb opened this issue 1 year ago • 1 comments

System Info

def _split_list_of_docs( docs: List[Document], length_func: Callable, token_max: int, **kwargs: Any ) -> List[List[Document]]: new_result_doc_list = [] _sub_result_docs = [] for doc in docs: _sub_result_docs.append(doc) _num_tokens = length_func(_sub_result_docs, **kwargs) if _num_tokens > token_max: if len(_sub_result_docs) == 1: raise ValueError( "A single document was longer than the context length," " we cannot handle this." ) if len(_sub_result_docs) == 2: raise ValueError( "A single document was so long it could not be combined " "with another document, we cannot handle this." ) new_result_doc_list.append(_sub_result_docs[:-1]) _sub_result_docs = _sub_result_docs[-1:] new_result_doc_list.append(_sub_result_docs) return new_result_doc_list

I encountered an issue with the following error message: "A single document was so long it could not be combined with another document, we cannot handle this." I suspect this could be a bug. The error might occur when the combined length of the summaries of two docs exceed the token_max limit. In this case, I believe that the two docs should be summarized separately and then merged. Could you provide a callback function allowing users to handle the logic of the _split_list_of_docs function by themselves?

Who can help?

No response

Information

  • [ ] The official example notebooks/scripts
  • [ ] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [ ] Document Loaders
  • [ ] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [x] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

https://github.com/hwchase17/langchain/blob/01531cb16d09b9290fc091434b0c69cb91a8f500/langchain/chains/combine_documents/map_reduce.py#L22

Expected behavior

I believe that the two docs should be summarized separately and then merged. Could you provide a callback function allowing users to handle the logic of the _split_list_of_docs function by themselves?

shangguansb avatar May 13 '23 04:05 shangguansb

+1

hou-rong avatar May 15 '23 12:05 hou-rong

meet the same issue.

A single document was so long it could not be combined with another document, we cannot handle this.

mario1in avatar May 29 '23 15:05 mario1in

Did you print your prompt to the console? When I printed the prompt used in the map_reduce’s split_of_docs function, I found that the content was the document's summary and the source was the original text of the document. When compared to other parts of the prompt template, I realized that I was using it incorrectly. I changed source to the document's id in the database because I believed source should only be used for locating purposes in the output of this prompt. Then the content of the message itself was much less, and the problem was resolved.

Also, to limit each content, I add WITHIN 300 CHARACTERS at the end of the prompt.

weihaopeng avatar Jun 08 '23 06:06 weihaopeng

@weihaopeng I ran into the same issue. My case is that I use a splitter with length_function = tiktoken, same as the model running the chain. Each document from the long text is chunked to just under 2500 tokens, leaving (gpt-3.5-turbo) at least a 1000 tokens for generation (and a few hundred tokens give or take for the map prompt instructions).

  • Each input to the first layer of map prompts is under the token limit.
  • Mm map prompt condenses the input, so generated text is always smaller than input.
  • MapReduce is implemented such that the combine chain is run only when all the maps steps' output, formatted using the combine prompt, is <3000
  • Yet I receive this error indicating that somewhere along the line, the combine step somehow calculated a maps steps output to be too large.

This shouldn't happen, if anyone else has any insight, I'd help debug and fix this. For what it's worth I see this on Japanese texts more, although this should be reproducible for any language/tokenization.

ShantanuNair avatar Jun 20 '23 16:06 ShantanuNair

Related: https://github.com/hwchase17/langchain/issues/6191 and https://github.com/hwchase17/langchain/issues/6397 Potentially related: https://github.com/hwchase17/langchain/issues/5829

ShantanuNair avatar Jun 20 '23 17:06 ShantanuNair

@shangguansb @hou-rong I am discussing with langchain team regarding overhauling the mapreduce implementation itself. If you want to be able to set token_max until then here is how you can do that :)

res = await chain(inputs={'input_documents': texts, 'token_max': 12000}, return_only_outputs=True)

ShantanuNair avatar Jul 05 '23 07:07 ShantanuNair

Hi, @shangguansb! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported is related to the _split_list_of_docs function in LangChain. This function throws an error when the combined length of two documents exceeds the token_max limit. Some users have shared their experiences and workarounds for this issue, and ShantanuNair is discussing with the LangChain team about overhauling the mapreduce implementation. In the meantime, users can set token_max by using a specific code snippet.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

dosubot[bot] avatar Oct 05 '23 16:10 dosubot[bot]