open-parse PIL.UnidentifiedImageError

Initial Checks

[X] I confirm that I'm on the latest version

Description

I've run into issues parsing some PDFs from the US House. For example:

https://aderholt.house.gov/sites/evo-subsites/aderholt.house.gov/files/evo-media-document/aderholt-challenger-center-disclosure-ltr-updated.pdf

With the following code below gives the traceback:

Traceback (most recent call last):
  File "/home/travis/problem.py", line 5, in <module>
    pdf_parser.parse(f)
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/doc_parser.py", line 111, in parse
    nodes = self.processing_pipeline.run(nodes)
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/ingest.py", line 42, in run
    nodes = transform_func.process(sorted(nodes))
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/basic_transforms.py", line 115, in process
    combined_image = self._combine_images_in_group(image_elements)
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/basic_transforms.py", line 47, in _combine_images_in_group
    image = Image.open(io.BytesIO(image_data))
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/PIL/Image.py", line 3536, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x77abf494f420

Example Code

import openparse

f = "aderholt-challenger-center-disclosure-ltr-updated.pdf"
pdf_parser = openparse.DocumentParser()
pdf_parser.parse(f)

Python, open-parse & OS Version

python_version: 3.10.1
             operating_system: Linux
                   os_version: 6.8.0-49-generic
           open-parse version: 0.7.0
                 install path: /home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse
               python version: 3.10.1 (main, May 23 2024, 14:57:20) [GCC 9.4.0]
                     platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
             related packages: PyMuPDF-1.24.14 pydantic-2.10.1

Since the error is from PIL, pillow==11.0.0

Dec 03 '24 22:12 thoppe

Not a solution

This bug does not exist in 0.6.1. Consider doing that if it's an urgent matter.

Dec 18 '24 22:12 tinosai

Not a solution

This bug does not exist in 0.6.1. Consider doing that if it's an urgent matter.

This error still occurs after I turn back to 0.6.1

Jan 05 '25 05:01 yueqingliang1

We reproduced this error setting up Open-Parse today as well. Love the project though! 😀😍

Jan 25 '25 08:01 mikeumus

@thoppe @yueqingliang1 if you don't need to process images, you can get past the error with this fork. ReadMe updates to show how to turn off image processing:

https://github.com/DivinciAI/open-parse/tree/main

Jan 29 '25 02:01 mikeumus

Hi there, I ran into the same bug. It results in crashing for around 19 pdf files out of 100. As a hotfix, I just superseeded the BasicIngestionPipeline without the CombineSlicedImages step. As suggested by @mikeumus, having the possibility to disable image processing would be a great feature :) Great project though 🤗

Feb 04 '25 11:02 MathieuCiancone

Hi there, I ran into the same bug. It results in crashing for around 19 pdf files out of 100. As a hotfix, I just superseeded the BasicIngestionPipeline without the CombineSlicedImages step. As suggested by @mikeumus, having the possibility to disable image processing would be a great feature :) Great project though 🤗

How did you avoided "CombineSlicedImages" step?

Apr 11 '25 06:04 kaushalWa