unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/PIL.UnidentifiedImageError: cannot identify image file

Open udit-pandey-1 opened this issue 1 year ago • 15 comments

Describe the bug I am getting the following error when extracting text and images from pdf: PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpjy0tjjjd/2c2e244f-8f8e-46de-a7bc-2ecfbaa254ea-566.ppm' image

To Reproduce The way I am using unstructured is: image

Expected behavior Ideally, all the images in the pdf must be extracted. If at all there is a failure, image extraction must not fail abruptly for the complete document(in my case, the pdf has 800 pages and it fails after going through 600 pages). For the layouts where image extraction failed, we can add a flag in the metadata that conveys that the image extraction failed and also provide reason for it. We should be able to get elements even in case of failures through a flag that is passed when calling partition().

Environment Info image

Any kind of quickfix to get elements even in case of failure would also be appreciated.

udit-pandey-1 avatar May 26 '24 10:05 udit-pandey-1

Hi @udit-pandey-1 - could you provide a URL that we could use to reproduce? I'd also give our SaaS API a try. Our unstructured-python-client library, will split the PDF up and distribute across multiple workers and should give you faster processing times.

MthwRobinson avatar May 28 '24 12:05 MthwRobinson

Hi @MthwRobinson, I got the above error in this file: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf Appreciate your efforts.

vegetableman avatar May 30 '24 02:05 vegetableman

Hi @vegetableman, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.

$ pip install unstructured -U
$ pip install unstructured-inference -U
elements = partition(
    url="https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)
print("\n\n".join([str(el) for el in elements]))

christinestraub avatar May 30 '24 13:05 christinestraub

The latest versions worked for me :+1:... I was using the specific versions mentioned here: https://github.com/Unstructured-IO/unstructured/issues/2566#issuecomment-1982063333 Thank you, Christine!

However, partition_pdf does not support loading pdf files through a url paramter unless i am mistaken. Had to use the parameter filename.

vegetableman avatar May 30 '24 16:05 vegetableman

Yes, as of now, partition_pdf does not support loading pdf files through a url parameter. Do we plan to do this? @MthwRobinson

christinestraub avatar May 30 '24 17:05 christinestraub

We don't plan to add that in partition_pdf as of now, though I believe that works in partition and will detect the MIME type from the HTTP response.

MthwRobinson avatar May 30 '24 17:05 MthwRobinson

@MthwRobinson that worked :+1: . My bad. Missed the module auto. Thank you!

vegetableman avatar May 31 '24 03:05 vegetableman

@christinestraub the issue is still occurring for me after upgrading the mentioned packages.

We are seeing this issue on Ubuntu 20.04.

udit-pandey-1 avatar May 31 '24 04:05 udit-pandey-1

here is a reference pdf file for it: https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf

udit-pandey-1 avatar May 31 '24 07:05 udit-pandey-1

@udit-pandey-1, I tried to partition the reference pdf file on both MacOS and Ubuntu(22.04). It worked as expected and I couldn't reproduce the error. Can you please try again?

Environment:

unstructured==0.14.6
unstructured-inference==0.7.35

Code:

from unstructured.partition.auto import partition

elements = partition(
    url="https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)

print("\n\n".join([str(el) for el in elements]))

christinestraub avatar Jun 19 '24 19:06 christinestraub

still the same @christinestraub

unstructured==0.14.6
unstructured-inference==0.7.36
image

udit-pandey-1 avatar Jun 28 '24 10:06 udit-pandey-1

@udit-pandey-1 I was wondering if you are sure that you installed the following system dependencies?

  • libmagic-dev (filetype detection)
  • poppler-utils (images and PDfs)

christinestraub avatar Jun 28 '24 16:06 christinestraub

libmagic-dev was'nt there. Installed it and then used the same code as above. Still failed with the same error.

udit-pandey-1 avatar Jul 01 '24 10:07 udit-pandey-1

Has there been a progress in this issue? I am facing the same problem, even after having tried everything.

sanyamjain0315 avatar Aug 12 '24 14:08 sanyamjain0315

Hi there I'm having the same issue: Python 3.10.12

unstructured                     0.14.6
unstructured-client              0.25.6
unstructured-inference           0.7.35
unstructured.pytesseract         0.3.13

Unfortunately I can't share the documents as they contain proprietary information.

This is happening for every PDF in a folder of 50. All were generated from HTML files by downloading with Chrome and saving with PDF.

Stacktrace:

---------------------------------------------------------------------------
UnidentifiedImageError                    Traceback (most recent call last)
[<ipython-input-21-a26b75af5795>](https://localhost:8080/#) in <cell line: 4>()
      4 for k in data.keys():
      5   fpath = f"/path/to/file/{k}"
----> 6   els = partition_pdf(filename=fpath, 
      7                       max_partition=1500,
      8                       chunking_strategy='by_title',

10 frames
[/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    603             unique_element_ids: bool = call_args.get("unique_element_ids", False)
    604             if unique_element_ids is False:
--> 605                 elements = assign_and_map_hash_ids(elements)
    606 
    607             return elements

/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)

/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/chunking/dispatch.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
     72 
     73         # -- call the partitioning function to get the elements --
---> 74         elements = func(*args, **kwargs)
     75 
     76         # -- look for a chunking-strategy argument --

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    208         form_extraction_skip_tables=form_extraction_skip_tables,
    209         **kwargs,
--> 210     )
    211 
    212 

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    344     if isinstance(file, bytes):
    345         file = io.BytesIO(file)
--> 346     return _partition_pdf_with_pdfminer(
    347         filename=filename,
    348         file=file,

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in _partition_pdf_or_image_with_ocr(filename, file, include_page_breaks, languages, ocr_languages, is_image, metadata_last_modified, starting_page_number, **kwargs)
    894             tmp_element = element
    895             tmp_text = element.text
--> 896             tmp_coords = element.metadata.coordinates
    897         elif tmp_element and check_coords_within_boundary(
    898             coordinates=element.metadata.coordinates,

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf_image/pdf_image_utils.py](https://localhost:8080/#) in convert_pdf_to_images(filename, file, chunk_size)
    414     date_from_file_object: bool = False,
    415 ) -> str | None:
--> 416     last_modification_date = None
    417     if not file and filename:
    418         last_modification_date = get_last_modified_date(filename=filename)

[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
    267                 )
    268             else:
--> 269                 images += parse_buffer_func(data)
    270     finally:
    271         if auto_temp_dir:

[/usr/local/lib/python3.10/dist-packages/pdf2image/parsers.py](https://localhost:8080/#) in parse_buffer_to_ppm(data)
     26         size_x, size_y = tuple(size.split(b" "))
     27         file_size = len(code) + len(size) + len(rgb) + 3 + int(size_x) * int(size_y) * 3
---> 28         images.append(Image.open(BytesIO(data[index : index + file_size])))
     29         index += file_size
     30 

[/usr/local/lib/python3.10/dist-packages/PIL/Image.py](https://localhost:8080/#) in open(fp, mode, formats)
   3281             raise TypeError(msg) from e
   3282     else:
-> 3283         rawmode = mode
   3284     if mode in ["1", "L", "I", "P", "F"]:
   3285         ndmax = 2

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7e086492d030>

tpakeman avatar Sep 03 '24 15:09 tpakeman

Same error: certain files with ppm extension throws unidentified error on hires statergy

Packages

python 3.12 unstructured==0.16.10 unstructured-client==0.28.1 unstructured-inference==0.8.1 pytesseract==0.3.13 pillow==11.0.0 unstructured.pytesseract==0.3.13

sidatcd avatar Dec 11 '24 01:12 sidatcd

Closing as inactive. Cannot reproduce, assumed resolved. If you're still seeing this and can provide a file that reproduces the error I'll take another look.

scanny avatar Dec 16 '24 21:12 scanny

Same error python 3.12

Packages

unstructured==0.16.10 unstructured-client==0.28.1 unstructured-inference==0.8.1 pytesseract==0.3.13 pillow==11.0.0 unstructured.pytesseract==0.3.13

CODE:

from PIL import Image as PILImage from PIL import ImageFile ImageFile.LOAD_TRUNCATED_IMAGES = True

partitions = partition_pdf( url=None, filename=filename, strategy="hi_res", extract_images_in_pdf=True, extract_image_block_types=["Image"], extract_image_block_to_payload=True, max_partition=None, unique_element_ids=True, extract_image_block_output_dir="/tmp", # Temporary directory to store images )

Not for all files but atleast 30% of the files, same error message PIL.UnidentifiedImageError: cannot identify image file <temporary ppm file>

Cant share files as confidential data

sidatcd avatar Dec 17 '24 03:12 sidatcd

Okay, good at least you are able to still reproduce it. I have an idea where to look.

scanny avatar Dec 17 '24 04:12 scanny

@sidatcd can you provide me a fresh stack-trace? I can't make any sense of the one earlier in the thread, possibly because of its age.

Also, do you have reason to believe the the problematic PDF files on your side contain PPM images? Those are a pretty old format, like 1980's era, but seem to be the format it is complaining about.

scanny avatar Dec 17 '24 05:12 scanny

@scanny Ha, My initial thought was the same. But saw the same error on fairly recent Pdfs as well.

relevant trace

line 48, in partition_document
    partitions = partition_pdf(
  File "/var/task/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/file_utils/filetype.py", line 725, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/file_utils/filetype.py", line 683, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/partition/pdf.py", line 209, in partition_pdf
    return partition_pdf_or_image(
  File "/var/task/unstructured/partition/pdf.py", line 305, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/var/task/unstructured/utils.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "/var/task/unstructured/partition/pdf.py", line 588, in _partition_pdf_or_image_local
    inferred_document_layout = process_file_with_model(
  File "/var/task/unstructured_inference/inference/layout.py", line 376, in process_file_with_model
    else DocumentLayout.from_file(
  File "/var/task/unstructured_inference/inference/layout.py", line 74, in from_file
    with Image.open(image_path) as image:
  File "/var/task/PIL/Image.py", line 3536, in open
    raise UnidentifiedImageError(msg)

Personally I would be happy if ppm files are not identified.

sidatcd avatar Dec 17 '24 06:12 sidatcd

Okay, looks like PPMs are coming from pdftoppm (part of poppler) as part of the process, so that explains the ppm bit anyway.

scanny avatar Dec 17 '24 06:12 scanny

@sidatcd Unfortunately I am unable to reproduce this with the PDF earlier in the thread. I'll have to close it for now because it's not actionable. If you are able to find a shareable document that produces the error we can reopen and I'll have another look.

Maybe you can narrow it down somehow, like maybe capturing the file that's causing the error, perhaps by printing out the path and copying the offending file to a new location where you can inspect it and/or post it, probably around this location in your local install: https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L74

Knowing the type of that file, its size, whether it can be inspected or whether it's possibly corrupted or something, all those could be useful hints. Also whether it happens late in the file (when more memory has been consumed) or earlier.

Another idea is catching the exception at that location and just skipping the file and seeing what happens. It looks like that would skip whole pages, but which pages get skipped could also be interesting insight.

Also, if the machine you're running on is memory constrained and perhaps the files where this happens contain many or very big images, the PPM image format is not compressed, so it does potentially consume a lot of memory. If you can check it on another machine with a different amount of memory and see if it gets better or gets worse, that would also be an interesting observation.

scanny avatar Dec 17 '24 19:12 scanny

@scanny Is there a way not to extract selected image formats?

sidatcd avatar Dec 17 '24 19:12 sidatcd

One thing i noticed was that i couldn't replicate this on mac but only on linux containers or custom python containers.Could it be a specific version of poppler utils?

sidatcd avatar Dec 17 '24 19:12 sidatcd

@sidatcd Regarding the images, that threw me at first too. But what's happening in this step is the entire PDF document is being rendered to a series of "page" images in preparation for "vision" processing by the layout/object-detection model; not "extracting" embedded images per se.

poppler is being used for this job and possibly because of when it was originally written, it uses the now-uncommon PPM format for rendering those pages. PPM does have the advantage that it is uncompressed (so faster because no expensive compression). And, it turns out, it is supported by Pillow (PIL). In any case, all those page images are going to be in PPM format so we can't just filter out PPMs.

The code that does this page rendering is here, and it uses pdf2image (which are bindings to poppler) for the job: https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L400


Regarding the Linux/Mac discrepancy:

  • That could be why I can't reproduce it, because I only have a Mac handy.
  • Versions are absolutely worth checking, I'd say poppler-utils, pdf2image (Python package), and Pillow (PIL) are all worth checking.
  • Definitely check differences in available memory. I've seen mention that poppler may just fail to render and not throw an error if it runs out of memory, could possibly be running out of memory mid-page or something and writing a truncated (and thereby corrupted) PPM file.

You could also try running the pdftoppm command-line program (part of poppler) on Linux against a problematic file and see what you get, possibly check hashes against what is produced on the Mac for the same file. I'd say that's definitely a good avenue to pursue.

scanny avatar Dec 17 '24 20:12 scanny