Extract 4-bits images from PDF

Open sheldonldev opened this issue 3 years ago • 1 comments

Feature request

I want to extract images from pdf with pdfminer.six, while error occurs when the pdf contains 4-bits images.

Here is the example:

import pdfminer
from pdfminer.image import ImageWriter
from pdfminer.high_level import extract_pages

pages = list(extract_pages('document.pdf'))
page = pages[0]


def get_image(layout_object):
    if isinstance(layout_object, pdfminer.layout.LTImage):
        return layout_object
    if isinstance(layout_object, pdfminer.layout.LTContainer):
        for child in layout_object:
            return get_image(child)
    else:
        return None


def save_images_from_page(page: pdfminer.layout.LTPage):
    images = list(filter(bool, map(get_image, page)))
    iw = ImageWriter('output_dir')
    for image in images:
        iw.export_image(image)


save_images_from_page(page)

I've attached the sample PDF here. 2-5.pdf pdfminer_image_error_code pdfminer_image_error_detail

Feb 03 '23 09:02 sheldonldev

PIL can do 2 and 4 bit images but only for mode L and P and you have to pass the mode and bits separated by semicolon as the raw_mode parameter - for example “L;4”.

I’m not in a spot to view the raw PDF to look at the dictionary for the images - if the image in question is 4 bit color it is probably indexed in which case you would need mode P and you need to decode the palette stream as well.

I am in the process of working through roughly the same thing (getting images from PDFs) and have found some tricks and limitations of both PIL and pdfminer.six.

Feb 11 '23 08:02 jcallaha