pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Raising wrong exception when selecting PNG predictor

Open manuGil opened this issue 1 year ago • 0 comments

The decode function in pdtypes.py raise an inaccurate exception when predictor is not supported: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/pdftypes.py#L378-L389

The code suggest that if a predictor is higher than 10, the exception the message "Unsupporte pedictor:" should be raised. But the behaviour of the program bypass this and raises the ValueError with a message "Unsupported bitspercomponent" in the apply_png_predictor.

Steps to reproduce:

  • Pass a LTImage object with at least the following stream and colorspace to `ImageWriter.export_image( ):
<PDFStream(13): raw=5846880, {'Subtype': /'Image', 'Length': 5846880, 'Filter': [/'FlateDecode'], 'SMask': <PDFObjRef:12>, 'BitsPerComponent': 8, 'ColorSpace': /'DeviceCMYK', 'Width': 2953, 'DecodeParms': [{'Columns': 2953, 'Predictor': 15, 'BitsPerComponent': 4, 'Colors': 4}], 'Height': 1205, 'Type': /'XObject', 'Decode': [0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0]}> 
# colorspace
[/'DeviceCMYK']
- This is is also reproducible when using the `pdf2txt.py`, and a PDF document contains an image element for an unsupported predictor.

  • Script use when the bug was found. Notice that the important part is the writing of an image element using the ImageWriter
        pages = []
        for page in tqdm(pdf_pages, desc="Reading pages", unit="pages"):
            elements = sort_layout_elements(page, img_height=IMG_SETTINGS["width"], img_width=IMG_SETTINGS["height"])
            pages.append(elements)

        for page in tqdm(pages, desc="Extracting images", total=len(pages), unit="pages"):

            iw = ImageWriter(image_directory)
        
            for img in page["images"]:
            
                visual = Visual(document_page=page["page_number"], document=pdf_document, bbox=img.bbox)
                
                # Search for captions using proximity to image Bboxes
                # This might generate multiple matches
                bbox_matches =[]
                for _text in page["texts"]:
                    match = find_caption_by_bbox(img, _text, offset=CAP_SETTINGS["offset"], 
                                                direction=CAP_SETTINGS["direction"])
                    if match:
                        bbox_matches.append(match)
                # Search for captions using text analysis (keywords)
                # if more than one bbox matches are found
                if len(bbox_matches) == 0:
                    pass # don't set any caption
                elif len(bbox_matches) == 1:
                    caption = ""
                    for text_line in bbox_matches[0]:
                        caption += text_line.get_text().strip() 
                    visual.set_caption(caption)
                else: # more than one matches in bbox_matches
                    for _text in bbox_matches:
                        text_match = find_caption_by_text(_text, keywords=CAP_SETTINGS["keywords"])
                    if text_match:
                        caption = ""
                        for text_line in bbox_matches[0]:
                            caption += text_line.get_text().strip() 
                    # Set the caption to the firt text match.
                    # All other matches will be ignored. 
                    # This may introduce mistakes
                        try:
                            visual.set_caption(caption)
                        except Warning: # ignore warnings when caption is already set.
                            pass
                        
                # rename image name to include page number
                img.name = "page" + str(page["page_number"]) + "-" + img.name
                # save image to file
            
                try:
                    print(img)
                    print(img.stream, img.colorspace)
                    image_file_name =iw.export_image(img) # returns image file name, 
                    # which last part is automatically generated by pdfminer to guarantee uniqueness
                except ValueError as e:
                    print(f"Image has unsupported bit depth? Image:{img.name}")
                    raise e

Expected Output:

Traceback (most recent call last):
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 172, in <module>
    main(str_id)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 134, in main
    image_file_name =iw.export_image(img) # returns image file name, 
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 126, in export_image
    name = self._save_bytes(image)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 224, in _save_bytes
    channels = len(image.stream.get_data()) / width / height / (image.bits / 8)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 398, in get_data
    self.decode()
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 391, in decode
    raise PDFNotImplementedError(error_msg)
pdfminer.pdftypes.PDFNotImplementedError: Unsupported predictor: 15

Actual Output:

  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 132, in main
    image_file_name =iw.export_image(img) # returns image file name, 
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 126, in export_image
    name = self._save_bytes(image)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 224, in _save_bytes
    channels = len(image.stream.get_data()) / width / height / (image.bits / 8)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 398, in get_data
    self.decode()
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 386, in decode
    data = apply_png_predictor(
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/utils.py", line 137, in apply_png_predictor
    raise ValueError(msg)
ValueError: Unsupported `bitspercomponent': 4

manuGil avatar Apr 24 '23 11:04 manuGil