pdfminer.six Raising wrong exception when selecting PNG predictor

Raising wrong exception when selecting PNG predictor

Open manuGil opened this issue 1 year ago • 0 comments

The decode function in pdtypes.py raise an inaccurate exception when predictor is not supported: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/pdftypes.py#L378-L389

The code suggest that if a predictor is higher than 10, the exception the message "Unsupporte pedictor:" should be raised. But the behaviour of the program bypass this and raises the ValueError with a message "Unsupported bitspercomponent" in the apply_png_predictor.

Steps to reproduce:

Pass a LTImage object with at least the following stream and colorspace to `ImageWriter.export_image( ):

<PDFStream(13): raw=5846880, {'Subtype': /'Image', 'Length': 5846880, 'Filter': [/'FlateDecode'], 'SMask': <PDFObjRef:12>, 'BitsPerComponent': 8, 'ColorSpace': /'DeviceCMYK', 'Width': 2953, 'DecodeParms': [{'Columns': 2953, 'Predictor': 15, 'BitsPerComponent': 4, 'Colors': 4}], 'Height': 1205, 'Type': /'XObject', 'Decode': [0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0]}> 
# colorspace
[/'DeviceCMYK']
- This is is also reproducible when using the `pdf2txt.py`, and a PDF document contains an image element for an unsupported predictor.

Script use when the bug was found. Notice that the important part is the writing of an image element using the ImageWriter

        pages = []
        for page in tqdm(pdf_pages, desc="Reading pages", unit="pages"):
            elements = sort_layout_elements(page, img_height=IMG_SETTINGS["width"], img_width=IMG_SETTINGS["height"])
            pages.append(elements)

        for page in tqdm(pages, desc="Extracting images", total=len(pages), unit="pages"):

            iw = ImageWriter(image_directory)
        
            for img in page["images"]:
            
                visual = Visual(document_page=page["page_number"], document=pdf_document, bbox=img.bbox)
                
                # Search for captions using proximity to image Bboxes
                # This might generate multiple matches
                bbox_matches =[]
                for _text in page["texts"]:
                    match = find_caption_by_bbox(img, _text, offset=CAP_SETTINGS["offset"], 
                                                direction=CAP_SETTINGS["direction"])
                    if match:
                        bbox_matches.append(match)
                # Search for captions using text analysis (keywords)
                # if more than one bbox matches are found
                if len(bbox_matches) == 0:
                    pass # don't set any caption
                elif len(bbox_matches) == 1:
                    caption = ""
                    for text_line in bbox_matches[0]:
                        caption += text_line.get_text().strip() 
                    visual.set_caption(caption)
                else: # more than one matches in bbox_matches
                    for _text in bbox_matches:
                        text_match = find_caption_by_text(_text, keywords=CAP_SETTINGS["keywords"])
                    if text_match:
                        caption = ""
                        for text_line in bbox_matches[0]:
                            caption += text_line.get_text().strip() 
                    # Set the caption to the firt text match.
                    # All other matches will be ignored. 
                    # This may introduce mistakes
                        try:
                            visual.set_caption(caption)
                        except Warning: # ignore warnings when caption is already set.
                            pass
                        
                # rename image name to include page number
                img.name = "page" + str(page["page_number"]) + "-" + img.name
                # save image to file
            
                try:
                    print(img)
                    print(img.stream, img.colorspace)
                    image_file_name =iw.export_image(img) # returns image file name, 
                    # which last part is automatically generated by pdfminer to guarantee uniqueness
                except ValueError as e:
                    print(f"Image has unsupported bit depth? Image:{img.name}")
                    raise e

Expected Output:

Traceback (most recent call last):
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 172, in <module>
    main(str_id)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 134, in main
    image_file_name =iw.export_image(img) # returns image file name, 
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 126, in export_image
    name = self._save_bytes(image)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 224, in _save_bytes
    channels = len(image.stream.get_data()) / width / height / (image.bits / 8)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 398, in get_data
    self.decode()
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 391, in decode
    raise PDFNotImplementedError(error_msg)
pdfminer.pdftypes.PDFNotImplementedError: Unsupported predictor: 15

Actual Output:

  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 132, in main
    image_file_name =iw.export_image(img) # returns image file name, 
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 126, in export_image
    name = self._save_bytes(image)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 224, in _save_bytes
    channels = len(image.stream.get_data()) / width / height / (image.bits / 8)
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 398, in get_data
    self.decode()
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 386, in decode
    data = apply_png_predictor(
  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/utils.py", line 137, in apply_png_predictor
    raise ValueError(msg)
ValueError: Unsupported `bitspercomponent': 4

Apr 24 '23 11:04 manuGil

pdfminer.six pdfminer.six copied to clipboard

Raising wrong exception when selecting PNG predictor

Expected Output:

Actual Output:

pdfminer.six
pdfminer.six copied to clipboard