Unable to load file
Maybe related to this. When using in the context of a binary file an error is thrown.
Example:
with open ("./that.pdf", 'rb') as f:
elements = partition_pdf(
file=f,
strategy='hi_res',
is_image=False,
include_page_breaks=True,
analysis=True,
infer_table_structure=True,
)
Error:
PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\LavoriV\AppData\Local\Temp\tmp_fqqu798': No error.
Hi @vlavorini - could let use know what versions and operating system you're using and share an example file we could use to reproduce?
I'm on Windows, with Unstructured version 0.14.2. Here the file I use population in EU.pdf
Thanks, @vlavorini !
I also encountered this issue. Is there any progress or update on this matter?
@andy1213aa can you post the code you used and also mention whether you are on Windows?
I used the same code provided above by @vlavorini, and yes, I am also running it on Windows. Is there currently a way to solve this problem on Windows? Thank you!
@MthwRobinson do you have any updates on the matter?
@simonschoe can you post a stack trace?
from unstructured.partition.pdf import partition_pdf
with open("testfile_with_images.pdf", 'rb') as f:
base64str = base64.b64encode(f.read()).decode('utf-8')
file_bytes = base64.b64decode(base64str)
file_bytes = io.BytesIO(file_bytes)
doc_elements = partition_pdf(
file=file_bytes,
#filename="testfile_with_images.pdf",
languages=['deu'],
strategy="hi_res",
hi_res_model_name="yolox",
)
---------------------------------------------------------------------------
[shortened]
File ~\pdf2image\pdf2image.py:127, in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
[124](~/pdf2image/pdf2image.py:124) if isinstance(poppler_path, PurePath):
[125](~/pdf2image/pdf2image.py:125) poppler_path = poppler_path.as_posix()
--> [127](~/pdf2image/pdf2image.py:127) page_count = pdfinfo_from_path(
[128](~/pdf2image/pdf2image.py:128) pdf_path, userpw, ownerpw, poppler_path=poppler_path
[129](~/pdf2image/pdf2image.py:129) )["Pages"]
[131](~/pdf2image/pdf2image.py:131) # We start by getting the output format, the buffer processing function and if we need pdftocairo
[132](~/pdf2image/pdf2image.py:132) parsed_fmt, final_extension, parse_buffer_func, use_pdfcairo_format = _parse_format(
[133](~/pdf2image/pdf2image.py:133) fmt, grayscale
[134](~/pdf2image/pdf2image.py:134) )
File ~\pdf2image\pdf2image.py:611, in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
[607](~/pdf2image/pdf2image.py:607) raise PDFInfoNotInstalledError(
[608](~/pdf2image/pdf2image.py:608) "Unable to get page count. Is poppler installed and in PATH?"
[609](~/pdf2image/pdf2image.py:609) )
[610](~/pdf2image/pdf2image.py:610) except ValueError:
--> [611](~/pdf2image/pdf2image.py:611) raise PDFPageCountError(
[612](~/pdf2image/pdf2image.py:612) f"Unable to get page count.\n{err.decode('utf8', 'ignore')}"
[613](~/pdf2image/pdf2image.py:613) )
PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\...\Temp\tmpzorb9m7z': No error.
The issue does not occur if I load the file using the filename arg though.
@simonschoe thanks for this :)
Okay, this looks like a bug that has been fixed on main but not released yet.
https://github.com/Unstructured-IO/unstructured-inference/commit/7804e0d7125dcfd8923fe7bbb1d255bfb1a96335
Can you try installing unstructured-inference from the main branch on GitHub? I think that's going to solve the problem. Something like this IIRC:
$ pip install -U unstructured-ingest @ git+https://github.com/Unstructured-IO/unstructured-ingest
I'll see about moving along a release.
Thanks for the feedback! Unfortunately, I have to resort to a stable release verion. I will look out for the upcoming unstructured-inference release
@simonschoe I've faced the issue due to some dependencies not being installed. The unstructured version I used is 0.15.9.
sudo apt-get install poppler-utils # recommend by https://stackoverflow.com/questions/53481088/poppler-in-path-for-pdf2image
sudo apt install tesseract-ocr # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
sudo apt install libtesseract-dev # recommended by https://tesseract-ocr.github.io/tessdoc/Installation.html
pip install tesseract # recommended by https://stackoverflow.com/a/52231794
pip install tesseract-ocr # recommended by https://stackoverflow.com/a/52231794
I've also found the command sudo apt-get install libleptonica-dev tesseract-ocr tesseract-ocr-dev libtesseract-dev python3-pil tesseract-ocr-eng tesseract-ocr-script-latn, but I thought it is legacy code since I will get the error E: Package 'tesseract-ocr-dev' has no installation candidate. You can further try the legacy code if you are still facing the PDFPageCountError: Unable to get page count. error after executing the above command.
Fixed by #3395.