langchain
langchain copied to clipboard
AttributeError: 'tuple' object has no attribute 'page_content' when running a `load_summarize_chain` on an my Document generated from PyPDF Loader
Code:
loader_book = PyPDFLoader("D:/PaperPal/langchain-tutorials/data/The Attention Merchants_ The Epic Scramble to Get Inside Our Heads ( PDFDrive ) (1).pdf")
test = loader_book.load()
chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=True)
chain.run(test[0])
I get the following error even when the test[0] is a Document object
> Entering new MapReduceDocumentsChain chain...
Output exceeds the [size limit](command:workbench.action.openSettings?%5B%22notebook.output.textLineLimit%22%5D). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?6f60f6d3-3206-4586-b2b2-d8a0f86e1aa0)---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
[d:\PaperPal\langchain-tutorials\chains\Chain](file:///D:/PaperPal/langchain-tutorials/chains/Chain) Types.ipynb Cell 19 in ()
----> [1](vscode-notebook-cell:/d%3A/PaperPal/langchain-tutorials/chains/Chain%20Types.ipynb#X16sZmlsZQ%3D%3D?line=0) chain.run(test[0])
File [c:\Users\mail2\anaconda3\lib\site-packages\langchain\chains\base.py:213](file:///C:/Users/mail2/anaconda3/lib/site-packages/langchain/chains/base.py:213), in Chain.run(self, *args, **kwargs)
211 if len(args) != 1:
212 raise ValueError("`run` supports only one positional argument.")
--> 213 return self(args[0])[self.output_keys[0]]
215 if kwargs and not args:
216 return self(kwargs)[self.output_keys[0]]
File [c:\Users\mail2\anaconda3\lib\site-packages\langchain\chains\base.py:116](file:///C:/Users/mail2/anaconda3/lib/site-packages/langchain/chains/base.py:116), in Chain.__call__(self, inputs, return_only_outputs)
114 except (KeyboardInterrupt, Exception) as e:
115 self.callback_manager.on_chain_error(e, verbose=self.verbose)
--> 116 raise e
117 self.callback_manager.on_chain_end(outputs, verbose=self.verbose)
118 return self.prep_outputs(inputs, outputs, return_only_outputs)
File [c:\Users\mail2\anaconda3\lib\site-packages\langchain\chains\base.py:113](file:///C:/Users/mail2/anaconda3/lib/site-packages/langchain/chains/base.py:113), in Chain.__call__(self, inputs, return_only_outputs)
107 self.callback_manager.on_chain_start(
108 {"name": self.__class__.__name__},
109 inputs,
110 verbose=self.verbose,
111 )
...
--> 141 [{**{self.document_variable_name: d.page_content}, **kwargs} for d in docs]
142 )
143 return self._process_results(results, docs, token_max, **kwargs)
AttributeError: 'tuple' object has no attribute 'page_content'
Had the same issue. Try chain.run([test[0]])
.
work around that works for me for now is setting doc = doc[0]
def format_document(doc: Document, prompt: BasePromptTemplate) -> str:
"""Format a document into a string based on a prompt template."""
doc = doc[0]
base_info = {"page_content": doc.page_content}
base_info.update(doc.metadata)
missing_metadata = set(prompt.input_variables).difference(base_info)
if len(missing_metadata) > 0:
required_metadata = [
iv for iv in prompt.input_variables if iv != "page_content"
]
raise ValueError(
f"Document prompt requires documents to have metadata variables: "
f"{required_metadata}. Received document with missing metadata: "
f"{list(missing_metadata)}."
)
document_info = {k: base_info[k] for k in prompt.input_variables}
return prompt.format(**document_info)
Refer the following link : https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html
Same problem as OP.
More on the problem. It appears FAISS returns docs with an extra layer of abstraction for similarity_search_with_score() and similarity_search_with_relevance_scores(). But it does not have that extra layer for similarity_search().
My code was: from langchain.vectorstores import FAISS
knowledge_base = FAISS.from_texts(chunks, embeddings)
docs = knowledge_base.similarity_search_with_score(user_question, k=4)
response = chain.run({"input_documents" : docs, "question" : user_question})
and the error is: File "D:_code\mycode\Langchain_PDF_Querybot\venv\Lib\site-packages\langchain\chains\combine_documents\refine.py", line 133, in _construct_initial_inputs base_info = {"page_content": docs[0].page_content} ^^^^^^^^^^^^^^^^^^^^ AttributeError: 'tuple' object has no attribute 'page_content'
This also fails with same error: docs = knowledge_base._similarity_search_with_relevance_scores(user_question, k=4)
Changing line 133 of refine.py (in a debugger) to: base_info = {"page_content": docs[0][0].page_content} allows page_content to be found, but I have no idea how badly that would mess up other things. Notably, if that change is made, the error re-appears again on the next line (134) and also later on during execution, so that is not a fix. I think the better fix is to have a consistent data representation for what is returned from FAISS.
What does work is: docs = knowledge_base.similarity_search(user_question, k=4) but now there is no relevance score.
Langchain is supposed to provide an abstraction layer that allows one to swap in different llms, different vector dbs, etc., but inconsistencies in the intermediate data representation like this make things harder, not easier.
So as a frustrated user, let me put in a strong argument for consistent representations of data, or if deviations have to occur, making it super obvious in the documentation. (This was not clear enough: https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html?highlight=faiss and neither was this: https://python.langchain.com/en/latest/use_cases/summarization.html).
The documentation does a good job of walking through the types of modules (both conceptually and by function), but a poor job of discussing the data structures at each layer that are used to interconnect the modules. A good starting point would be giving each data structure type a name, a spec on the structural variations allowed and a list of what it interconnects. Most of the code in the documentation is snippets, making it hard to clearly see the bigger picture from these. With respect to the longer use cases provided, there are lots of web complaints about interchanging LLMs or DBs not working, which is why I think the problem is more fundamental than just the specific bug/issue discussed above.
If the documentation had a brighter spotlight on the data structures used between modules, this might help prevent a lot of these issues.
This also happens when using DeepLake
Hi, @Vishruth-N! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you reported is related to an AttributeError that occurs when running a load_summarize_chain
on a Document generated from PyPDF Loader. There have been some suggestions provided by other users as potential workarounds. One user suggested using chain.run([test[0]])
, while another user shared a code snippet that sets doc = doc[0]
as a temporary fix.
Before we close this issue, we would like to confirm if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!
same problem here
Yes, still getting this issue, why is it closed? Per documentation,
from langchain_core.documents import Document
text = "My secret title is 'King Boomy'."
data = Document(page_content=text)
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)
Yields the following error:
Traceback (most recent call last):
File "/home/<user>/rag/pipeline.py", line 11, in <module>
all_splits = text_splitter.split_documents(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/<user>/.conda/envs/rag/lib/python3.11/site-packages/langchain_text_splitters/base.py", line 94, in split_documents
texts.append(doc.page_content)
^^^^^^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'page_content'
Passing in 'data[0]' yields:
Traceback (most recent call last):
File "/home/<user>/rag/pipeline.py", line 11, in <module>
all_splits = text_splitter.split_documents(data)
~~~~^^^
TypeError: 'Document' object is not subscriptable