Extract 4-bits images from PDF
Feature request
I want to extract images from pdf with pdfminer.six, while error occurs when the pdf contains 4-bits images.
Here is the example:
import pdfminer
from pdfminer.image import ImageWriter
from pdfminer.high_level import extract_pages
pages = list(extract_pages('document.pdf'))
page = pages[0]
def get_image(layout_object):
if isinstance(layout_object, pdfminer.layout.LTImage):
return layout_object
if isinstance(layout_object, pdfminer.layout.LTContainer):
for child in layout_object:
return get_image(child)
else:
return None
def save_images_from_page(page: pdfminer.layout.LTPage):
images = list(filter(bool, map(get_image, page)))
iw = ImageWriter('output_dir')
for image in images:
iw.export_image(image)
save_images_from_page(page)
I've attached the sample PDF here.
2-5.pdf

PIL can do 2 and 4 bit images but only for mode L and P and you have to pass the mode and bits separated by semicolon as the raw_mode parameter - for example “L;4”.
I’m not in a spot to view the raw PDF to look at the dictionary for the images - if the image in question is 4 bit color it is probably indexed in which case you would need mode P and you need to decode the palette stream as well.
I am in the process of working through roughly the same thing (getting images from PDFs) and have found some tricks and limitations of both PIL and pdfminer.six.