need info working with `page.images`

Open mratanusarkar opened this issue 1 year ago • 1 comments

The current page.images[0] dump looks like:

{'x0': 37.4602, 'y0': 180.816, 'x1': 53.6929, 'y1': 196.9833, 'width': 16.2327, 'height': 16.16730000000001, 'stream': <PDFStream(24254): raw=213, {'BitsPerComponent': 1, 'DecodeParms': {'Quality': 65}, 'Filter': /'JBIG2Decode', 'Height': 34, 'ImageMask': True, 'Intent': /'RelativeColorimetric', 'Length': 213, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 34}>, 'srcsize': (34, 34), 'imagemask': True, 'bits': 1, 'colorspace': [None], 'mcid': None, 'tag': None, 'object_type': 'image', 'page_number': 33, 'top': 647.7407000000001, 'bottom': 663.908, 'doctop': 27679.82469999998}
{'x0': 56.6807, 'y0': 471.272, 'x1': 317.3447, 'y1': 795.7760000000001, 'width': 260.664, 'height': 324.5040000000001, 'stream': <PDFStream(145): raw=47341, {'BitsPerComponent': 8, 'ColorSpace': <PDFObjRef:74050>, 'Filter': /'JPXDecode', 'Height': 676, 'Intent': /'RelativeColorimetric', 'Length': 47341, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 543}>, 'srcsize': (543, 676), 'imagemask': None, 'bits': 8, 'colorspace': [[/'Separation', /'Black', /'DeviceCMYK', {'C0': [0, 0, 0, 0], 'C1': [0, 0, 0, 1], 'Domain': [0, 1], 'FunctionType': 2, 'N': 1, 'Range': [0, 1, 0, 1, 0, 1, 0, 1]}]], 'mcid': None, 'tag': None, 'object_type': 'image', 'page_number': 33, 'top': 48.94799999999998, 'bottom': 373.45200000000006, 'doctop': 27081.03199999998}

I need help working with this and extracting the image data. I would like to export it to a png image or use pillow. at this point, getting hold of the images in nay format would work, and I can convert and use it as desired.

could anyone help me get access to the image data from page.images? I am trying to extract and export all images, figures, diagrams from each page of a PDF.

#1207 helps a bit, but I am struggling with some errors and issues with that!

some insight on this might even encourage me or someone to write image handling class in pdfplumber.

Thanks!

Oct 19 '24 15:10 mratanusarkar

Hi @mratanusarkar — can you provide a minimal, runnable Python script and PDF that reproduces the errors you're encountering?

Nov 22 '24 01:11 jsvine