langchain icon indicating copy to clipboard operation
langchain copied to clipboard

OnlinePDFLoader crashes with import error on Google Colab

Open ishan-siddiqui opened this issue 4 months ago • 1 comments

Checked other resources

  • [X] I added a very descriptive title to this issue.
  • [X] I searched the LangChain documentation with the integrated search.
  • [X] I used the GitHub search to find a similar question and didn't find it.
  • [X] I am sure that this is a bug in LangChain rather than my code.
  • [X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Steps to Replicate:

Requirements.txt

%%writefile requirements.txt
replicate
langchain
langchain-community
sentence-transformers
pdf2image
pdfminer
pdfminer.six
unstructured
faiss-gpu
uvicorn
ctransformers
python-box
streamlit

Installing on colab

!pip install -r requirements.txt

Code I am trying to run

# Load the external data source
from langchain.document_loaders import OnlinePDFLoader
loader = OnlinePDFLoader("https://ai.meta.com/static-resource/responsible-use-guide/")
documents = loader.load()

Error Message and Stack Trace (if applicable)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
[<ipython-input-90-759c82deb3bb>](https://localhost:8080/#) in <cell line: 4>()
      2 from langchain_community.document_loaders import OnlinePDFLoader
      3 loader = OnlinePDFLoader("https://ai.meta.com/static-resource/responsible-use-guide/")
----> 4 documents = loader.load()
      5 
      6 # Step 2: Get text splits from Document

4 frames
[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py](https://localhost:8080/#) in load(self)
    157         """Load documents."""
    158         loader = UnstructuredPDFLoader(str(self.file_path))
--> 159         return loader.load()
    160 
    161 

[/usr/local/lib/python3.10/dist-packages/langchain_core/document_loaders/base.py](https://localhost:8080/#) in load(self)
     27     def load(self) -> List[Document]:
     28         """Load data into Document objects."""
---> 29         return list(self.lazy_load())
     30 
     31     async def aload(self) -> List[Document]:

[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/unstructured.py](https://localhost:8080/#) in lazy_load(self)
     86     def lazy_load(self) -> Iterator[Document]:
     87         """Load file."""
---> 88         elements = self._get_elements()
     89         self._post_process_elements(elements)
     90         if self.mode == "elements":

[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py](https://localhost:8080/#) in _get_elements(self)
     69 
     70     def _get_elements(self) -> List:
---> 71         from unstructured.partition.pdf import partition_pdf
     72 
     73         return partition_pdf(filename=self.file_path, **self.unstructured_kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in <module>
     36 from pdfminer.utils import open_filename
     37 from PIL import Image as PILImage
---> 38 from pillow_heif import register_heif_opener
     39 
     40 from unstructured.chunking import add_chunking_strategy

ModuleNotFoundError: No module named 'pillow_heif'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Description

  • I am trying to use langchain on my google colab notebook to load a pdf.
  • Expected response : load the pdf
  • Instead, it is giving ModuleNotFoundError: No module named 'pillow_heif'

System Info

Langchain Version on Google Colab

langchain==0.1.16
langchain-community==0.0.34
langchain-core==0.1.45
langchain-text-splitters==0.0.1

Langchain Community Version on Google Colab

langchain-community==0.0.34

ishan-siddiqui avatar Apr 20 '24 19:04 ishan-siddiqui

Trying to follow Meta Developer's llama-2 tutorial. Here's a link for reference - https://youtu.be/Z5MFSlDrOdA?t=1539

ishan-siddiqui avatar Apr 20 '24 19:04 ishan-siddiqui

Hi @ishan-siddiqui , you will need to install the unstructured package before the import:

pip install unstructured[all-docs]

Source: unstructured_file.ipynb

salikadave avatar Apr 22 '24 04:04 salikadave