OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Optimize images with SMask

Open benbro opened this issue 2 years ago • 3 comments

Is your feature request related to a problem? Please describe. Some tools create PDFs with huge images. optimize.py currently doesn't optimize images with SMask. Is there a way to improve it?

Describe the solution you'd like Downsample/optimize images with SMask.

Describe alternatives you've considered ghostscript can optimize and downsample an image with SMask: gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dPDFSETTINGS=/screen -sOutputFile=optimized.pdf test.pdf

Example file This pdf created with an old version of PowerPoint and has an image with SMask: test.pdf

Additional context Add any other context or screenshots about the feature request here.

benbro avatar Feb 24 '23 10:02 benbro

For the file in question, the image is inside a Form XObject which we don't currently attempt to optimize, so it's excluded for that reason. The SMask is already compressed with Flate to a high ratio and likely can't be improved.

For implementing SMask in general: we cannot use pngquant on SMask, because pngquant tries to find a palette and PDF doesn't accept palettes on SMask. That's the main reason they are skipped.

In the rare case of lossy SMask, we should probably leave them alone because there's a risk of compression artifacts messing up images in novel ways.

We could also apply Flate to any SMask that is not compressed. (We can that by saving the file PDF with recompress_flate in pikepdf rather than the normal image optimize, just by changing get_pdf_save_settings, although the savings achieved won't be logged as usual.) That's fairly easy to do.

We could potentially export the image+SMask as RGBA to pngquant, and then split the <8-bit RGBA palette image into a <8-bit palette image and 8-bit SMask. This gets complicated and needs to be checked in both the premultiplied (SMask with Matte set) and straight cases, but might lead to bigger savings on both images combined.

jbarlow83 avatar Feb 24 '23 23:02 jbarlow83

I'm asking about option to downsample SMask because in the attached pdf it is large for no apparent reason. If I'm not mistaken, large SMask will result with large memory usage and longer rendering time in pdf viewers. I'm less interested in reducing the pdf size for this specific case.

benbro avatar Feb 25 '23 00:02 benbro

Yes, an SMask that large will take excessive resources for rendering and will end up downsampled by the PDF viewer anyway. We'll have to see about opt-in downsampling in a future release.

jbarlow83 avatar Feb 26 '23 23:02 jbarlow83