unstructured Unable to load file

Maybe related to this. When using in the context of a binary file an error is thrown.

Example:

with open ("./that.pdf", 'rb') as f:

    elements = partition_pdf(
            file=f,
            strategy='hi_res',
            is_image=False,
            include_page_breaks=True,
            analysis=True,
            infer_table_structure=True,
    
        )

Error:

PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\LavoriV\AppData\Local\Temp\tmp_fqqu798': No error.

May 24 '24 11:05 vlavorini

Hi @vlavorini - could let use know what versions and operating system you're using and share an example file we could use to reproduce?

May 24 '24 14:05 MthwRobinson

I'm on Windows, with Unstructured version 0.14.2. Here the file I use population in EU.pdf

May 28 '24 11:05 vlavorini

Thanks, @vlavorini !

May 28 '24 12:05 MthwRobinson

I also encountered this issue. Is there any progress or update on this matter?

Jul 03 '24 14:07 andy1213aa

@andy1213aa can you post the code you used and also mention whether you are on Windows?

Jul 03 '24 18:07 scanny

I used the same code provided above by @vlavorini, and yes, I am also running it on Windows. Is there currently a way to solve this problem on Windows? Thank you!

Jul 03 '24 18:07 andy1213aa

@MthwRobinson do you have any updates on the matter?

Aug 19 '24 08:08 simonschoe

@simonschoe can you post a stack trace?

Aug 19 '24 17:08 scanny

from unstructured.partition.pdf import partition_pdf

with open("testfile_with_images.pdf", 'rb') as f:
    base64str = base64.b64encode(f.read()).decode('utf-8')

file_bytes = base64.b64decode(base64str)
file_bytes = io.BytesIO(file_bytes)

doc_elements = partition_pdf(
    file=file_bytes,
    #filename="testfile_with_images.pdf",
    languages=['deu'],
    strategy="hi_res", 
    hi_res_model_name="yolox",
)

---------------------------------------------------------------------------
[shortened]

File ~\pdf2image\pdf2image.py:127, in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
    [124](~/pdf2image/pdf2image.py:124) if isinstance(poppler_path, PurePath):
    [125](~/pdf2image/pdf2image.py:125)     poppler_path = poppler_path.as_posix()
--> [127](~/pdf2image/pdf2image.py:127) page_count = pdfinfo_from_path(
    [128](~/pdf2image/pdf2image.py:128)     pdf_path, userpw, ownerpw, poppler_path=poppler_path
    [129](~/pdf2image/pdf2image.py:129) )["Pages"]
    [131](~/pdf2image/pdf2image.py:131) # We start by getting the output format, the buffer processing function and if we need pdftocairo
    [132](~/pdf2image/pdf2image.py:132) parsed_fmt, final_extension, parse_buffer_func, use_pdfcairo_format = _parse_format(
    [133](~/pdf2image/pdf2image.py:133)     fmt, grayscale
    [134](~/pdf2image/pdf2image.py:134) )

File ~\pdf2image\pdf2image.py:611, in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    [607](~/pdf2image/pdf2image.py:607)     raise PDFInfoNotInstalledError(
    [608](~/pdf2image/pdf2image.py:608)         "Unable to get page count. Is poppler installed and in PATH?"
    [609](~/pdf2image/pdf2image.py:609)     )
    [610](~/pdf2image/pdf2image.py:610) except ValueError:
--> [611](~/pdf2image/pdf2image.py:611)     raise PDFPageCountError(
    [612](~/pdf2image/pdf2image.py:612)         f"Unable to get page count.\n{err.decode('utf8', 'ignore')}"
    [613](~/pdf2image/pdf2image.py:613)     )

PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\...\Temp\tmpzorb9m7z': No error.

The issue does not occur if I load the file using the filename arg though.

Aug 19 '24 18:08 simonschoe

@simonschoe thanks for this :)

Okay, this looks like a bug that has been fixed on main but not released yet. https://github.com/Unstructured-IO/unstructured-inference/commit/7804e0d7125dcfd8923fe7bbb1d255bfb1a96335

Can you try installing unstructured-inference from the main branch on GitHub? I think that's going to solve the problem. Something like this IIRC:

$ pip install -U unstructured-ingest @ git+https://github.com/Unstructured-IO/unstructured-ingest

I'll see about moving along a release.

Aug 19 '24 19:08 scanny

Thanks for the feedback! Unfortunately, I have to resort to a stable release verion. I will look out for the upcoming unstructured-inference release

Aug 21 '24 18:08 simonschoe

@simonschoe I've faced the issue due to some dependencies not being installed. The unstructured version I used is 0.15.9.

sudo apt-get install poppler-utils # recommend by https://stackoverflow.com/questions/53481088/poppler-in-path-for-pdf2image
sudo apt install tesseract-ocr # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
sudo apt install libtesseract-dev # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
pip install tesseract # recommended by https://stackoverflow.com/a/52231794
pip install tesseract-ocr # recommended by https://stackoverflow.com/a/52231794

I've also found the command sudo apt-get install libleptonica-dev tesseract-ocr tesseract-ocr-dev libtesseract-dev python3-pil tesseract-ocr-eng tesseract-ocr-script-latn, but I thought it is legacy code since I will get the error E: Package 'tesseract-ocr-dev' has no installation candidate. You can further try the legacy code if you are still facing the PDFPageCountError: Unable to get page count. error after executing the above command.

Sep 10 '24 09:09 HuangBugWei

Fixed by #3395.

Dec 16 '24 21:12 scanny