Losslessly optimize JPEGs and PNGs
If either of these features are already present, I apologize for the spam. I'm not fluent enough in reading code to tell whether OCRmyPDF already does either of these, or if it goes as far in its attempts as the software I normally use for image optimization.
It is possible to losslessly reduce the size of already encoded JPEG images with mozjpeg, a Mozilla fork of libjpeg-turbo. Not sure about APIs and such, but once mozjpeg is installed, you can try it on JPEGs with the following command:
/opt/mozjpeg/bin/cjpeg -optimize test.jpg >test-optimized.jpg
It's not uncommon to get a size reduction of several percent. (This command won't work on the libjpeg-turbo equivalent binary, since it doesn't support JPEG input). There is the issue is that most distro repositories don't provide mozjpeg, so support would need to be optional.
Lossless PNG optimization can be done with optipng, which is more commonly found in distro repos. It has the option of changing how much CPU times is devoted to optimization attempts. -o3 is the default and it's really diminishing returns after that.
At least in the case of optipng, the optimization could be quite CPU intensive for PDFs with lots of large PNGs, though it'd also depend on the optimization level used. For PNGs, optimization time scales pretty much linearly with file size; PNGs that are several megabytes in size take a long time. Mozjpeg, on the other hand, is very fast in my experience, and optimizing even hundreds of files shouldn't be an issue.
We do optimization of JPEG using Pillow's optimize=True which does a little of what mozjpeg does. At least, it enables adaptive encoding. mozjpeg sounds more sophisticated. For monochrome, we also do lossless JBIG2 optimization of CCITT or PNG images. For other PNG, we use pngquant which is lossy.
I don't think mozjpeg is lossless, at least not from the description of techniques they are using. It's a more efficient lossy encoder than typical jpeg with higher CPU usage. It sounds like they are doing mathematical optimization to find the optimal quantization table for a given quality level.
For mozjpeg, I'm a little reluctant to add another optional dependency that's not in Debian or Fedora. On a quick search, it seems like the main issue Debian had with mozjpeg is that it conflicts with other libjpeg* packages, and interest fizzled out. It also has a complicated license which doesn't bode well for packagers.
For optipng, I suppose we could do this and it would be easy, but it's not much of a gain over pngquant. In my experience it's fairly common for PNGs in PDFs to be low color count - the kind that quantizes very nicely.
I suppose this is at the stage of "would accept a pull request, won't do myself".
Sounds like a task for good old jpegoptim
@HansBull Do you know if jpegoptim has any advantages over Pillow with optimize=True?
Good question. Maybe not. One would have to test. Unfortunately I neved entered into the python world ...
Optimizing with OxiPNG (in this example oxipng -o 5 -i 0 --strip safe) as an optional dependency would be great: it's fast and produces better results than pngquant:
$ ls -l *.png
-rw-r--r-- 1 user user 400K Jan 10 02:57 jb-010.png
-rw-r--r-- 1 user user 414K Jan 10 02:58 jb-010-fs8.png
-rw-r--r-- 1 user user 446K Jan 10 02:57 jb-011.png
-rw-r--r-- 1 user user 464K Jan 10 02:58 jb-011-fs8.png
-rw-r--r-- 1 user user 113K Jan 10 02:57 jb-012.png
-rw-r--r-- 1 user user 112K Jan 10 02:58 jb-012-fs8.png
-rw-r--r-- 1 user user 616K Jan 10 02:58 jb-013.png
-rw-r--r-- 1 user user 629K Jan 10 02:58 jb-013-fs8.png
@homocomputeris Thanks for the suggestion. Generally I wait for packages to be mature enough that they're accepted as a Debian/Ubuntu package. That also makes testing in CI easier. Oxipng isn't there yet but hopefully it will get there soon.