pdfrw
pdfrw copied to clipboard
Support stream extraction
This would be useful for image extration, perhaps other purposes, too.
Given that pdfrw already supports finding the image objects (using the find_objects
function), would it be possible to also support the extraction of the streams that the said image objects refer to?
I'm not sure exactly what you are asking, but the pdfrw stream parser may certainly be used by user code. See for example, examples/rl2/decodegraphics.py
, for an incomplete example that works for a few PDFs.
Thanks. I am not sure I understand what the example is supposed to do. Is it parsing some primitive draw operations (from a stream containing them) and converting those to corresponding reportlab operations, or what?
I am looking for a way to extract PNG or JPG images from PDF. From what I have understood, they are just embedded as byte streams in the PDF
I now realized find_objects
returns objects of type PdfDict
, that do have a stream
attribute. Is this what, in case of images, contains the (encoded) image byte data that can be decoded to get the embedded PNG or JPG?
The images embedded in my PDF file seem to use something called /FlateDecode
. From googling it seems it is something that can be passed to ´gzip.decompress` to produce the final image file?
Here's a print of the said PdfDict object that refers to embedded image:
{'/Length': '37873', '/Type': '/XObject', '/Subtype': '/Image', '/Width': '601', '/Height': '103', '/Interpolate': 'true', '/ColorSpace': ['/ICCBased', {'/Length': '2615', '/N': '3', '/Alternate': '/DeviceRGB', '/Filter': '/FlateDecode'}], '/Intent': '/Perceptual', '/BitsPerComponent': '8', '/Filter': '/FlateDecode'}
There is ongoing work in pdfrw to support stream decompression. It works in some case, but not in others.
The data in a stream may be analogous to the data in an image file, but it is not normally an actual compressed image file. For example, an image file will have a header that will not appear in the data stream.
There is a python tool called img2pdf that will place images into PDF files. Examining that might give you a clue about the best way to extract images from PDF files.
But note that pdfrw already contains all the code that is necessary to extract the data, with the possible exception of performing the right kind of decompression. What it doesn't have, in addition to comprehensive decompression support, is the knowledge of how to build a container for an image file format, such as a jpeg or png file.
Very interesting. I'm looking for a way to extract, recompress and re-insert images.
It would be cool to compress all images inside the PDF with e.g. Google Guetzli and re-insert them.
Will see if the comp/decomp part is available from other libs.
@m3nu - have a look at PyMuPDF. It is however based on the C-library MuPDF, i.e. not pure Python. And it is licensed under GNU GPL 3.0 and GNU AGPL 3.0. For Windows, binaries are available for all Python versions. On other platforms (Mac and all Unix flavors supported), you must generate MuPDF, before you can install PyMuPDF. Other than that, it sounds you can get what you want from it.
That looks great. Thanks for the pointer @JorjMcKie
Have there been any developments on this front? I'm trying to extract images, edit them with OpenCV, and then insert them back into the PDF.
Alternatively, are there any libraries that can do this?
I'm not aware of any other libs and didn't manage to replace images on the fly.
My use-case was to recompress images. I ended up using ghostscript to recompress. It processes all images in a PDF.