Compatibility issue between unstructured[pdf]==0.16.11 and pdfminer.six 20250327
Description: When installing unstructured[pdf]==0.16.11, it pulls in the latest pdfminer.six 20250327, which causes import errors.
Steps to reproduce:
- pip install "unstructured[pdf]==0.16.11"
- Run the following code:
from langchain_community.document_loaders import UnstructuredPDFLoader
doc = './IF10244.pdf'
loader = UnstructuredPDFLoader(file_path=doc,
strategy='hi_res',
extract_images_in_pdf=True,
infer_table_structure=True,
mode='elements',
image_output_dir_path='./figures')
data = loader.load()
Error:
ImportError: cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser' (/usr/local/lib/python3.11/dist-packages/pdfminer/pdfparser.py)
The issue appears to be that unstructured is trying to import PSSyntaxError from pdfminer.pdfparser, but this class isn't available in the newest version of pdfminer.six. Workaround: Downgrading pdfminer.six resolves the issue: pip install pdfminer.six==20240706
Environment: Python version: 3.11 OS: [linux] unstructured: 0.16.11 pdfminer.six: 20250327 (fails), 20240706 (works)
HAVE ANYBODY SOLVE THIS ISSUE?
The setup I have listed in https://github.com/robot-stefan/langchain-rag/blob/main/requirements.txt also works on python 3.10 just ran into this migrating and pulling packages for new venv . Hope this helps others.
pip install pdfminer.six==20240706
Thank you so much. It worked! :)