unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Compatibility issue between unstructured[pdf]==0.16.11 and pdfminer.six 20250327

Open Badribn0612 opened this issue 8 months ago • 3 comments

Description: When installing unstructured[pdf]==0.16.11, it pulls in the latest pdfminer.six 20250327, which causes import errors.

Steps to reproduce:

  1. pip install "unstructured[pdf]==0.16.11"
  2. Run the following code:
from langchain_community.document_loaders import UnstructuredPDFLoader

doc = './IF10244.pdf'
loader = UnstructuredPDFLoader(file_path=doc,
                               strategy='hi_res',
                               extract_images_in_pdf=True,
                               infer_table_structure=True,
                               mode='elements',
                               image_output_dir_path='./figures')
data = loader.load()

Error:
ImportError: cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser' (/usr/local/lib/python3.11/dist-packages/pdfminer/pdfparser.py)

The issue appears to be that unstructured is trying to import PSSyntaxError from pdfminer.pdfparser, but this class isn't available in the newest version of pdfminer.six. Workaround: Downgrading pdfminer.six resolves the issue: pip install pdfminer.six==20240706

Environment: Python version: 3.11 OS: [linux] unstructured: 0.16.11 pdfminer.six: 20250327 (fails), 20240706 (works)

Badribn0612 avatar Apr 09 '25 05:04 Badribn0612

HAVE ANYBODY SOLVE THIS ISSUE?

buffliu avatar Apr 22 '25 01:04 buffliu

The setup I have listed in https://github.com/robot-stefan/langchain-rag/blob/main/requirements.txt also works on python 3.10 just ran into this migrating and pulling packages for new venv . Hope this helps others.

robot-stefan avatar May 10 '25 00:05 robot-stefan

pip install pdfminer.six==20240706

Thank you so much. It worked! :)

deepakdhiman7 avatar Jun 24 '25 14:06 deepakdhiman7