pdfminer.six
pdfminer.six copied to clipboard
Raising wrong exception when selecting PNG predictor
The decode
function in pdtypes.py raise an inaccurate exception when predictor is not supported:
https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/pdftypes.py#L378-L389
The code suggest that if a predictor is higher than 10, the exception the message "Unsupporte pedictor:" should be raised. But the behaviour of the program bypass this and raises the ValueError
with a message "Unsupported bitspercomponent" in the apply_png_predictor.
Steps to reproduce:
- Pass a LTImage object with at least the following stream and colorspace to `ImageWriter.export_image(
):
<PDFStream(13): raw=5846880, {'Subtype': /'Image', 'Length': 5846880, 'Filter': [/'FlateDecode'], 'SMask': <PDFObjRef:12>, 'BitsPerComponent': 8, 'ColorSpace': /'DeviceCMYK', 'Width': 2953, 'DecodeParms': [{'Columns': 2953, 'Predictor': 15, 'BitsPerComponent': 4, 'Colors': 4}], 'Height': 1205, 'Type': /'XObject', 'Decode': [0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0]}>
# colorspace
[/'DeviceCMYK']
- This is is also reproducible when using the `pdf2txt.py`, and a PDF document contains an image element for an unsupported predictor.
- Script use when the bug was found. Notice that the important part is the writing of an image element using the ImageWriter
pages = []
for page in tqdm(pdf_pages, desc="Reading pages", unit="pages"):
elements = sort_layout_elements(page, img_height=IMG_SETTINGS["width"], img_width=IMG_SETTINGS["height"])
pages.append(elements)
for page in tqdm(pages, desc="Extracting images", total=len(pages), unit="pages"):
iw = ImageWriter(image_directory)
for img in page["images"]:
visual = Visual(document_page=page["page_number"], document=pdf_document, bbox=img.bbox)
# Search for captions using proximity to image Bboxes
# This might generate multiple matches
bbox_matches =[]
for _text in page["texts"]:
match = find_caption_by_bbox(img, _text, offset=CAP_SETTINGS["offset"],
direction=CAP_SETTINGS["direction"])
if match:
bbox_matches.append(match)
# Search for captions using text analysis (keywords)
# if more than one bbox matches are found
if len(bbox_matches) == 0:
pass # don't set any caption
elif len(bbox_matches) == 1:
caption = ""
for text_line in bbox_matches[0]:
caption += text_line.get_text().strip()
visual.set_caption(caption)
else: # more than one matches in bbox_matches
for _text in bbox_matches:
text_match = find_caption_by_text(_text, keywords=CAP_SETTINGS["keywords"])
if text_match:
caption = ""
for text_line in bbox_matches[0]:
caption += text_line.get_text().strip()
# Set the caption to the firt text match.
# All other matches will be ignored.
# This may introduce mistakes
try:
visual.set_caption(caption)
except Warning: # ignore warnings when caption is already set.
pass
# rename image name to include page number
img.name = "page" + str(page["page_number"]) + "-" + img.name
# save image to file
try:
print(img)
print(img.stream, img.colorspace)
image_file_name =iw.export_image(img) # returns image file name,
# which last part is automatically generated by pdfminer to guarantee uniqueness
except ValueError as e:
print(f"Image has unsupported bit depth? Image:{img.name}")
raise e
Expected Output:
Traceback (most recent call last):
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 172, in <module>
main(str_id)
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 134, in main
image_file_name =iw.export_image(img) # returns image file name,
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 126, in export_image
name = self._save_bytes(image)
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 224, in _save_bytes
channels = len(image.stream.get_data()) / width / height / (image.bits / 8)
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 398, in get_data
self.decode()
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 391, in decode
raise PDFNotImplementedError(error_msg)
pdfminer.pdftypes.PDFNotImplementedError: Unsupported predictor: 15
Actual Output:
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 132, in main
image_file_name =iw.export_image(img) # returns image file name,
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 126, in export_image
name = self._save_bytes(image)
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/image.py", line 224, in _save_bytes
channels = len(image.stream.get_data()) / width / height / (image.bits / 8)
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 398, in get_data
self.decode()
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/pdftypes.py", line 386, in decode
data = apply_png_predictor(
File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/venv2/lib/python3.10/site-packages/pdfminer/utils.py", line 137, in apply_png_predictor
raise ValueError(msg)
ValueError: Unsupported `bitspercomponent': 4