PIL.UnidentifiedImageError
Initial Checks
- [X] I confirm that I'm on the latest version
Description
I've run into issues parsing some PDFs from the US House. For example:
https://aderholt.house.gov/sites/evo-subsites/aderholt.house.gov/files/evo-media-document/aderholt-challenger-center-disclosure-ltr-updated.pdf
With the following code below gives the traceback:
Traceback (most recent call last):
File "/home/travis/problem.py", line 5, in <module>
pdf_parser.parse(f)
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/doc_parser.py", line 111, in parse
nodes = self.processing_pipeline.run(nodes)
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/ingest.py", line 42, in run
nodes = transform_func.process(sorted(nodes))
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/basic_transforms.py", line 115, in process
combined_image = self._combine_images_in_group(image_elements)
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/basic_transforms.py", line 47, in _combine_images_in_group
image = Image.open(io.BytesIO(image_data))
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/PIL/Image.py", line 3536, in open
raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x77abf494f420
Example Code
import openparse
f = "aderholt-challenger-center-disclosure-ltr-updated.pdf"
pdf_parser = openparse.DocumentParser()
pdf_parser.parse(f)
Python, open-parse & OS Version
python_version: 3.10.1
operating_system: Linux
os_version: 6.8.0-49-generic
open-parse version: 0.7.0
install path: /home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse
python version: 3.10.1 (main, May 23 2024, 14:57:20) [GCC 9.4.0]
platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
related packages: PyMuPDF-1.24.14 pydantic-2.10.1
Since the error is from PIL, pillow==11.0.0
Not a solution
This bug does not exist in 0.6.1. Consider doing that if it's an urgent matter.
Not a solution
This bug does not exist in 0.6.1. Consider doing that if it's an urgent matter.
This error still occurs after I turn back to 0.6.1
We reproduced this error setting up Open-Parse today as well. Love the project though! 😀😍
@thoppe @yueqingliang1 if you don't need to process images, you can get past the error with this fork. ReadMe updates to show how to turn off image processing:
- https://github.com/DivinciAI/open-parse/tree/main
Hi there,
I ran into the same bug. It results in crashing for around 19 pdf files out of 100. As a hotfix, I just superseeded the BasicIngestionPipeline without the CombineSlicedImages step. As suggested by @mikeumus, having the possibility to disable image processing would be a great feature :)
Great project though 🤗
Hi there, I ran into the same bug. It results in crashing for around 19 pdf files out of 100. As a hotfix, I just superseeded the
BasicIngestionPipelinewithout theCombineSlicedImagesstep. As suggested by @mikeumus, having the possibility to disable image processing would be a great feature :) Great project though 🤗
How did you avoided "CombineSlicedImages" step?