langchain
langchain copied to clipboard
map_reduce._split_list_of_docs has bugs
System Info
def _split_list_of_docs( docs: List[Document], length_func: Callable, token_max: int, **kwargs: Any ) -> List[List[Document]]: new_result_doc_list = [] _sub_result_docs = [] for doc in docs: _sub_result_docs.append(doc) _num_tokens = length_func(_sub_result_docs, **kwargs) if _num_tokens > token_max: if len(_sub_result_docs) == 1: raise ValueError( "A single document was longer than the context length," " we cannot handle this." ) if len(_sub_result_docs) == 2: raise ValueError( "A single document was so long it could not be combined " "with another document, we cannot handle this." ) new_result_doc_list.append(_sub_result_docs[:-1]) _sub_result_docs = _sub_result_docs[-1:] new_result_doc_list.append(_sub_result_docs) return new_result_doc_list
I encountered an issue with the following error message: "A single document was so long it could not be combined with another document, we cannot handle this." I suspect this could be a bug. The error might occur when the combined length of the summaries of two docs exceed the token_max limit. In this case, I believe that the two docs should be summarized separately and then merged. Could you provide a callback function allowing users to handle the logic of the _split_list_of_docs function by themselves?
Who can help?
No response
Information
- [ ] The official example notebooks/scripts
- [ ] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [ ] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [ ] Document Loaders
- [ ] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [x] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
https://github.com/hwchase17/langchain/blob/01531cb16d09b9290fc091434b0c69cb91a8f500/langchain/chains/combine_documents/map_reduce.py#L22
Expected behavior
I believe that the two docs should be summarized separately and then merged. Could you provide a callback function allowing users to handle the logic of the _split_list_of_docs function by themselves?
+1
meet the same issue.
A single document was so long it could not be combined with another document, we cannot handle this.
Did you print your prompt
to the console? When I printed the prompt
used in the map_reduce’s split_of_docs function, I found that the content
was the document's summary and the source
was the original text of the document. When compared to other parts of the prompt template, I realized that I was using it incorrectly. I changed source
to the document's id
in the database because I believed source
should only be used for locating purposes in the output of this prompt. Then the content of the message itself was much less, and the problem was resolved.
Also, to limit each content
, I add WITHIN 300 CHARACTERS
at the end of the prompt.
@weihaopeng I ran into the same issue. My case is that I use a splitter with length_function = tiktoken, same as the model running the chain. Each document from the long text is chunked to just under 2500 tokens, leaving (gpt-3.5-turbo) at least a 1000 tokens for generation (and a few hundred tokens give or take for the map prompt instructions).
- Each input to the first layer of map prompts is under the token limit.
- Mm map prompt condenses the input, so generated text is always smaller than input.
- MapReduce is implemented such that the combine chain is run only when all the maps steps' output, formatted using the combine prompt, is <3000
- Yet I receive this error indicating that somewhere along the line, the combine step somehow calculated a maps steps output to be too large.
This shouldn't happen, if anyone else has any insight, I'd help debug and fix this. For what it's worth I see this on Japanese texts more, although this should be reproducible for any language/tokenization.
Related: https://github.com/hwchase17/langchain/issues/6191 and https://github.com/hwchase17/langchain/issues/6397 Potentially related: https://github.com/hwchase17/langchain/issues/5829
@shangguansb @hou-rong I am discussing with langchain team regarding overhauling the mapreduce implementation itself. If you want to be able to set token_max
until then here is how you can do that :)
res = await chain(inputs={'input_documents': texts, 'token_max': 12000}, return_only_outputs=True)
Hi, @shangguansb! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you reported is related to the _split_list_of_docs
function in LangChain. This function throws an error when the combined length of two documents exceeds the token_max
limit. Some users have shared their experiences and workarounds for this issue, and ShantanuNair is discussing with the LangChain team about overhauling the mapreduce implementation. In the meantime, users can set token_max
by using a specific code snippet.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain project!