pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

need info working with `page.images`

Open mratanusarkar opened this issue 1 year ago • 1 comments

The current page.images[0] dump looks like:

{'x0': 37.4602, 'y0': 180.816, 'x1': 53.6929, 'y1': 196.9833, 'width': 16.2327, 'height': 16.16730000000001, 'stream': <PDFStream(24254): raw=213, {'BitsPerComponent': 1, 'DecodeParms': {'Quality': 65}, 'Filter': /'JBIG2Decode', 'Height': 34, 'ImageMask': True, 'Intent': /'RelativeColorimetric', 'Length': 213, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 34}>, 'srcsize': (34, 34), 'imagemask': True, 'bits': 1, 'colorspace': [None], 'mcid': None, 'tag': None, 'object_type': 'image', 'page_number': 33, 'top': 647.7407000000001, 'bottom': 663.908, 'doctop': 27679.82469999998}
{'x0': 56.6807, 'y0': 471.272, 'x1': 317.3447, 'y1': 795.7760000000001, 'width': 260.664, 'height': 324.5040000000001, 'stream': <PDFStream(145): raw=47341, {'BitsPerComponent': 8, 'ColorSpace': <PDFObjRef:74050>, 'Filter': /'JPXDecode', 'Height': 676, 'Intent': /'RelativeColorimetric', 'Length': 47341, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 543}>, 'srcsize': (543, 676), 'imagemask': None, 'bits': 8, 'colorspace': [[/'Separation', /'Black', /'DeviceCMYK', {'C0': [0, 0, 0, 0], 'C1': [0, 0, 0, 1], 'Domain': [0, 1], 'FunctionType': 2, 'N': 1, 'Range': [0, 1, 0, 1, 0, 1, 0, 1]}]], 'mcid': None, 'tag': None, 'object_type': 'image', 'page_number': 33, 'top': 48.94799999999998, 'bottom': 373.45200000000006, 'doctop': 27081.03199999998}

I need help working with this and extracting the image data. I would like to export it to a png image or use pillow. at this point, getting hold of the images in nay format would work, and I can convert and use it as desired.

could anyone help me get access to the image data from page.images? I am trying to extract and export all images, figures, diagrams from each page of a PDF.

#1207 helps a bit, but I am struggling with some errors and issues with that!

some insight on this might even encourage me or someone to write image handling class in pdfplumber.

Thanks!

mratanusarkar avatar Oct 19 '24 15:10 mratanusarkar

Hi @mratanusarkar — can you provide a minimal, runnable Python script and PDF that reproduces the errors you're encountering?

jsvine avatar Nov 22 '24 01:11 jsvine