langchain icon indicating copy to clipboard operation
langchain copied to clipboard

AttributeError: 'tuple' object has no attribute 'page_content' when running a `load_summarize_chain` on an my Document generated from PyPDF Loader

Open Vishruth-N opened this issue 1 year ago • 3 comments

Code:

loader_book = PyPDFLoader("D:/PaperPal/langchain-tutorials/data/The Attention Merchants_ The Epic Scramble to Get Inside Our Heads ( PDFDrive ) (1).pdf")
test = loader_book.load()
chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=True)
chain.run(test[0])

I get the following error even when the test[0] is a Document object

> Entering new MapReduceDocumentsChain chain...
Output exceeds the [size limit](command:workbench.action.openSettings?%5B%22notebook.output.textLineLimit%22%5D). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?6f60f6d3-3206-4586-b2b2-d8a0f86e1aa0)---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[d:\PaperPal\langchain-tutorials\chains\Chain](file:///D:/PaperPal/langchain-tutorials/chains/Chain) Types.ipynb Cell 19 in ()
----> [1](vscode-notebook-cell:/d%3A/PaperPal/langchain-tutorials/chains/Chain%20Types.ipynb#X16sZmlsZQ%3D%3D?line=0) chain.run(test[0])

File [c:\Users\mail2\anaconda3\lib\site-packages\langchain\chains\base.py:213](file:///C:/Users/mail2/anaconda3/lib/site-packages/langchain/chains/base.py:213), in Chain.run(self, *args, **kwargs)
    211     if len(args) != 1:
    212         raise ValueError("`run` supports only one positional argument.")
--> 213     return self(args[0])[self.output_keys[0]]
    215 if kwargs and not args:
    216     return self(kwargs)[self.output_keys[0]]

File [c:\Users\mail2\anaconda3\lib\site-packages\langchain\chains\base.py:116](file:///C:/Users/mail2/anaconda3/lib/site-packages/langchain/chains/base.py:116), in Chain.__call__(self, inputs, return_only_outputs)
    114 except (KeyboardInterrupt, Exception) as e:
    115     self.callback_manager.on_chain_error(e, verbose=self.verbose)
--> 116     raise e
    117 self.callback_manager.on_chain_end(outputs, verbose=self.verbose)
    118 return self.prep_outputs(inputs, outputs, return_only_outputs)

File [c:\Users\mail2\anaconda3\lib\site-packages\langchain\chains\base.py:113](file:///C:/Users/mail2/anaconda3/lib/site-packages/langchain/chains/base.py:113), in Chain.__call__(self, inputs, return_only_outputs)
    107 self.callback_manager.on_chain_start(
    108     {"name": self.__class__.__name__},
    109     inputs,
    110     verbose=self.verbose,
    111 )
...
--> 141         [{**{self.document_variable_name: d.page_content}, **kwargs} for d in docs]
    142     )
    143     return self._process_results(results, docs, token_max, **kwargs)

AttributeError: 'tuple' object has no attribute 'page_content'

Vishruth-N avatar Apr 12 '23 00:04 Vishruth-N

Had the same issue. Try chain.run([test[0]]).

olivier-lacroix avatar Apr 12 '23 12:04 olivier-lacroix

work around that works for me for now is setting doc = doc[0]

def format_document(doc: Document, prompt: BasePromptTemplate) -> str:
  """Format a document into a string based on a prompt template."""

  doc = doc[0]

  base_info = {"page_content": doc.page_content}
  base_info.update(doc.metadata)
  missing_metadata = set(prompt.input_variables).difference(base_info)
  if len(missing_metadata) > 0:
      required_metadata = [
          iv for iv in prompt.input_variables if iv != "page_content"
      ]
      raise ValueError(
          f"Document prompt requires documents to have metadata variables: "
          f"{required_metadata}. Received document with missing metadata: "
          f"{list(missing_metadata)}."
      )
  document_info = {k: base_info[k] for k in prompt.input_variables}
  return prompt.format(**document_info)

OlajideOgun avatar May 06 '23 22:05 OlajideOgun

Refer the following link : https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html

codemaker2015 avatar May 22 '23 22:05 codemaker2015

Same problem as OP.

codeisnotcode avatar May 29 '23 08:05 codeisnotcode

More on the problem. It appears FAISS returns docs with an extra layer of abstraction for similarity_search_with_score() and similarity_search_with_relevance_scores(). But it does not have that extra layer for similarity_search().

My code was: from langchain.vectorstores import FAISS

knowledge_base = FAISS.from_texts(chunks, embeddings)

docs = knowledge_base.similarity_search_with_score(user_question, k=4)

response = chain.run({"input_documents" : docs, "question" : user_question})

and the error is: File "D:_code\mycode\Langchain_PDF_Querybot\venv\Lib\site-packages\langchain\chains\combine_documents\refine.py", line 133, in _construct_initial_inputs base_info = {"page_content": docs[0].page_content} ^^^^^^^^^^^^^^^^^^^^ AttributeError: 'tuple' object has no attribute 'page_content'

This also fails with same error: docs = knowledge_base._similarity_search_with_relevance_scores(user_question, k=4)

Changing line 133 of refine.py (in a debugger) to: base_info = {"page_content": docs[0][0].page_content} allows page_content to be found, but I have no idea how badly that would mess up other things. Notably, if that change is made, the error re-appears again on the next line (134) and also later on during execution, so that is not a fix. I think the better fix is to have a consistent data representation for what is returned from FAISS.

What does work is: docs = knowledge_base.similarity_search(user_question, k=4) but now there is no relevance score.

Langchain is supposed to provide an abstraction layer that allows one to swap in different llms, different vector dbs, etc., but inconsistencies in the intermediate data representation like this make things harder, not easier.

So as a frustrated user, let me put in a strong argument for consistent representations of data, or if deviations have to occur, making it super obvious in the documentation. (This was not clear enough: https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html?highlight=faiss and neither was this: https://python.langchain.com/en/latest/use_cases/summarization.html).

The documentation does a good job of walking through the types of modules (both conceptually and by function), but a poor job of discussing the data structures at each layer that are used to interconnect the modules. A good starting point would be giving each data structure type a name, a spec on the structural variations allowed and a list of what it interconnects. Most of the code in the documentation is snippets, making it hard to clearly see the bigger picture from these. With respect to the longer use cases provided, there are lots of web complaints about interchanging LLMs or DBs not working, which is why I think the problem is more fundamental than just the specific bug/issue discussed above.

If the documentation had a brighter spotlight on the data structures used between modules, this might help prevent a lot of these issues.

codeisnotcode avatar May 29 '23 22:05 codeisnotcode

This also happens when using DeepLake

OlajideOgun avatar May 29 '23 23:05 OlajideOgun

Hi, @Vishruth-N! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported is related to an AttributeError that occurs when running a load_summarize_chain on a Document generated from PyPDF Loader. There have been some suggestions provided by other users as potential workarounds. One user suggested using chain.run([test[0]]), while another user shared a code snippet that sets doc = doc[0] as a temporary fix.

Before we close this issue, we would like to confirm if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Sep 20 '23 16:09 dosubot[bot]

same problem here image

ahmedDaoudi-u avatar Apr 04 '24 10:04 ahmedDaoudi-u

Yes, still getting this issue, why is it closed? Per documentation,

from langchain_core.documents import Document
text = "My secret title is 'King Boomy'."
data = Document(page_content=text)
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)

Yields the following error:

Traceback (most recent call last):
  File "/home/<user>/rag/pipeline.py", line 11, in <module>
    all_splits = text_splitter.split_documents(data)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/<user>/.conda/envs/rag/lib/python3.11/site-packages/langchain_text_splitters/base.py", line 94, in split_documents
    texts.append(doc.page_content)
                 ^^^^^^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'page_content'

Passing in 'data[0]' yields:

Traceback (most recent call last):
  File "/home/<user>/rag/pipeline.py", line 11, in <module>
    all_splits = text_splitter.split_documents(data)
                                               ~~~~^^^
TypeError: 'Document' object is not subscriptable

DrewGalbraith avatar Jun 13 '24 20:06 DrewGalbraith