pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Extract 4-bits images from PDF

Open sheldonldev opened this issue 3 years ago • 1 comments

Feature request

I want to extract images from pdf with pdfminer.six, while error occurs when the pdf contains 4-bits images.

Here is the example:

import pdfminer
from pdfminer.image import ImageWriter
from pdfminer.high_level import extract_pages

pages = list(extract_pages('document.pdf'))
page = pages[0]


def get_image(layout_object):
    if isinstance(layout_object, pdfminer.layout.LTImage):
        return layout_object
    if isinstance(layout_object, pdfminer.layout.LTContainer):
        for child in layout_object:
            return get_image(child)
    else:
        return None


def save_images_from_page(page: pdfminer.layout.LTPage):
    images = list(filter(bool, map(get_image, page)))
    iw = ImageWriter('output_dir')
    for image in images:
        iw.export_image(image)


save_images_from_page(page)

I've attached the sample PDF here. 2-5.pdf pdfminer_image_error_code pdfminer_image_error_detail

sheldonldev avatar Feb 03 '23 09:02 sheldonldev

PIL can do 2 and 4 bit images but only for mode L and P and you have to pass the mode and bits separated by semicolon as the raw_mode parameter - for example “L;4”.

I’m not in a spot to view the raw PDF to look at the dictionary for the images - if the image in question is 4 bit color it is probably indexed in which case you would need mode P and you need to decode the palette stream as well.

I am in the process of working through roughly the same thing (getting images from PDFs) and have found some tricks and limitations of both PIL and pdfminer.six.

jcallaha avatar Feb 11 '23 08:02 jcallaha