OCRmyPDF
OCRmyPDF copied to clipboard
Introduce a way to radically reduce the output file size (sacrificing image quality)
Is your feature request related to a problem? Please describe. My use case is "scanning" documents with a smartphone camera, then archiving those "scans" as low-quality monochrome images. But OCR should be done beforehand on the high-quality images.
I describes this in more detail here: https://github.com/jbarlow83/OCRmyPDF/issues/443#issuecomment-618589203
Furthermore I see a discussion covering a similar topic here: #293
Describe the solution you'd like I want greater control of image quality for the images embedded into the PDF (after doing OCR). I can imagine those possible solutions (each point is a complete solution):
- add a parameter that forces all images to be converted to 1 bpp images (low effort)
- add a parameter allowing arbitrary shell commands to be passed that will be executed by OCRmyPDF on the images in the temporary folder, before OCRmyPDF handles them further (high effort, security implications?)
- introduce multiple parameters that allow for more control of the things that go on in the optimization step (probably here) (medium effort?)
Additional context I'm currently evaluating how to achieve my goal with the least effort. I see two approaches:
- let OCRmyPDF do it's thing on high quality images/PDFs; post-process manually using pikepdf using a Python script that replaces the high quality images with low quality ones in the PDF (I have a working PoC, but it's not pretty)
- modify OCRmyPDF
I'm not sure about the second approach - where would be a good point to start? One approach could be:
- using PNG images in the input PDF file, then
- forcing pngquant to convert them to 1 bpp (here?)
- this could trigger PNG rewriting as G4 (here)
@jbarlow83 Does this sound right?
I would go with modifying ocrmypdf, and:
- Always input JPG
- Replace
pngquant.quantize
with code that always converts the image to 1bpp (e.g. just use PIllow). - You will actually want to install jbig2enc. JBIG2 outperforms G4 in size and is still widely supported. 1bpp PNGs will always be converted to JBIG2 when a jbig2 encoder is available. You might even want JBIG2 in lossy mode, provided the dangers of lossy mode are acceptable to you (see documentation and the "6-8" problem).
Instead of forcing PNG input, you could also uncomment the optimize.py:523 "try pngifying the jpegs" which as the name suggests, speculatively converts JPEGs to PNGs. I believe this had a few corner cases to be worked out and is too costly in performance in the typical case, but you could try that, especially if you are forcing everything to JBIG2 anyway.
I'm giving it a try and am having some success.
@jbarlow83 A question: This return
doesn't look right since it leaves the function after handling only one image. Is this ok?
https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L426
For me this leads to only one of multiple images being handled in a multi-page PDF, where each page contains one image. (Since the loop cannot finish.)
And one (related?) curiosity: I managed to modify the conversion pipeline such that I now have multiple 1 bpp PNGs waiting in the temp folder to be handled. If there is only one such PNG the resulting PDF looks fine. If there are multiple such images the resulting PDF is distorted. Looking at the images in the temp folder I got:
- my quality-reduced PNGs
- the corresponding generated TIFs - each looking good
Then the code converts those TIFs to JBIG2 file(s) by invoking the jbig2 tool. This seems to be errorneous if there are multiple TIFs (leading to distortions in the final PDF). It works for one TIF though. So the question is: do you have a test in place checking that PDFs with multiple 1 bpp images can correctly be converted to the JBIG2 format? Or could this be a bug?
Note: I suspect that above mentioned return
prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.
(But this might also be me not understanding how the final JBIG2 handling works. I might have broken something with my modifications.)
Edit: the debug output shows me this command line that is being used by OCRmyPDF:
DEBUG - Running: ['jbig2', '-b', 'group00000000', '-s', '-p', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000032.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000028.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000030.tif']
The TIF files look good.
I found the reason why my PDF containing the 1 bpp JBIG2 images was distorted. The color space of the embedded images was not correct. It was still /DeviceRGB
:
But correct would be /DeviceGray
.
I was able to quick-fix this by inserting im_obj.ColorSpace = Name("/DeviceGray")
right before this line:
https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L340
The PDF now looks good.
Hypothesis: it was never intended to change the color space during image optimization?
Edit: Suggested fix:
if (Name.BitsPerComponent in im_obj and im_obj.BitsPerComponent == 1):
log.debug("Setting ColorSpace to /DeviceGray")
im_obj.ColorSpace = Name("/DeviceGray")
Edit2:
Better fix?
Add im_obj.ColorSpace = Name("/DeviceGray")
here:
https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L430
I implemented and pushed a solution that works for me and is basically a shortcut to TIF generation (see above linked commit). I added a new user script option that can be used to run arbitrary shell commands on images. This user script takes the source and destination file pathes as input parameter and must convert the source image to a 1 bpp TIF.
The shell script that works for me looks like this:
#!/bin/sh
convert -colorspace gray -fill white -sharpen 0x2 "$1" - | jpegtopnm | pamthreshold | pamtotiff -g4 > "$2"
This requires ImageMagick and netpbm-progs to be installed. But one could use other conversion tools here as well. pamthreshold
implements a nice dynamic threshold.
The command that I used to test looks like this:
ocrmypdf --user-script-jpg-to-1bpp-tif shell.sh --jbig2-lossy -v 1 -O3 in.pdf out.pdf
- in.pdf: 791 KB (created from three colored JPGs using
img2pdf
) - out.pdf: 72 KB
- Optimize ratio: 11.19 savings: 91.1%
I'm not opening a pull request since the solution is very narrow to my use case. And right now it only handles JPEG images. But maybe somebody finds this useful as a starting point.
I suspect that above mentioned return prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.
You are correct, those return
s are wrong and will suppress multiple images per file. That's a great catch.
Hypothesis: it was never intended to change the color space during image optimization?
Also correct. /DeviceGray is not correct in general, but probably suitable for your use case. Some files will specify a complex colorspace instead of /DeviceRGB and changing to /DeviceGray may not be correct, so optimize tries to avoid changing colorspace. It is also possible to specify a 1-bit color colorspace, e.g. 0 is blue and 1 is red.
I'm not opening a pull request since the solution
Agreed - that's a lot of new dependencies to add.
I also needed exactly this!
I tried to rebase unto master, missed some things in the manual merges required and added them afterwards, so my branch doesn’t look so clean right now. But here it is: https://github.com/andersjohansson/OCRmyPDF/tree/feature/github-541-reduce-output-file-size-v10
It works fine now though! Thanks!
userscript.py could be structured as a plugin instead (new feature for 10.x). You'd need to create a new hook as well by adding it to pluginspec.py
, and then we could have a generic, pluggable interface for people who want to optimize images more aggressively.
If @heinrich-ulbricht or anyone else is interested in looking more into this in the future, see also the comments that @jbarlow83 added here: https://github.com/andersjohansson/OCRmyPDF/commit/4e5b68f1b966312edeba8ef3b6e12037bac8aef6
What about using MRC compression to visually keep the file as much as the original but loosing lots of size as @jbarlow83 mentioned here:
https://github.com/jbarlow83/OCRmyPDF/issues/836#issuecomment-922560147
(We do not do page color segmentation at this time, i.e., finding regions of a page or image that can be represented with a reduced colorspace. It's not an easy feature to implement and will probably need a corporate sponsor so that I can work on it full time for a few weeks. You do get better compression if you're able to work with the original PDFs.)
You could just look at how closed source DjVuSolo 3.1 does reach astonishing sizes with really legible results, and even keeping color in JBIG2-like JB2. With DjVuToy you can transform those DjVu's into PDF's that are only about twice as big.
With https://github.com/jwilk/didjvu there has been an attempt to open source this MRC-mechanism, however with some inconveniences that keep files too big to be a serious candidate to replace the old DjVuSolo 3.1 in the Russian user group.
However many DjVu-patents have expired, so there might be some valuable MRC-knowledge in those patents, as @jsbien suggested.
https://github.com/jwilk/didjvu/issues/18
@rmast This is interesting information and could be helpful if I ever get the opportunity to implement this. Thanks.
(Found this through @rmast) -- If you're looking for a MRC implementation, https://github.com/internetarchive/archive-pdf-tools does this when it creates PDFs with text layers (it's mostly like OCRMyPDF but doesn't attempt to do OCR and requires that be done externally) - the MRC code can also be used as a library, although I probably need to make the API a bit more ... accessible. @jbarlow83 - If you're interested in this I could try to make a more accessible API. Alternatively, I could look at improving the "pdf recoding" method some where the software compresses an existing PDF by replacing the images with MRC compression images, so then one could just run recode_pdf after OCRmyPDF has done its thing.
@MerlijnWajer Thanks for the suggestion - that is impressive work. Unfortunately it's license-incompatible (AGPL) and also uses PyMuPDF as its PDF generator. I like PyMuPDF and used it previously, but it relies on libmupdf which is only released as a static library and doesn't promise a stable interface, meaning that Linux distributions won't include it.
But setting it up through a plugin interface, calling recode_pdf by command line, would certainly be doable.
I'll try to implement this mode (modifying the images of a PDF without touching most other parts) in the next week or so and report back, then we could maybe look at the plugin path. (Actually, give me more like two weeks, I'll have to do some refactoring to support this recoding mode)
It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.
It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.
Right - I'll have to think about that (and also ask). For now I will try to get a tool to recode an existing PDF working first, since I've been wanting to add/implement that for a long time anyway, and this is a great motivation to do it. I'll also make the MRC API more usable (current code is heavily optimised for performance, not for API usability), though, so we could revisit the potential license situation once that is done.
@blaueente @v217 I saw your input in these issues concerning introducing MRC into OCRMyPDF: https://github.com/ocrmypdf/OCRmyPDF/issues/9 https://github.com/fritz-hh/OCRmyPDF/issues/88
I understand license-(in)compatibility is inhibiting progress.
I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.
Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?
@blaueente @v217 I saw your input in these issues concerning introducing MRC into OCRMyPDF: #9 fritz-hh/OCRmyPDF#88
I understand license-(in)compatibility is inhibiting progress.
I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.
Didjvu itself mainly deals with organizing everything, so I guess one couldn't use code from it directly anyways. C44 / iw44 is the wavelet codec used by didjvu, and therefore unusable for PDF MRCs. The ideas of archive-pdf-tools seem pretty good to me, maybe they could learn from gamera's separation algorithms, and the ROI-style coding of iw44, although I see good discussions in their github page.
Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?
Regarding licenses, I can't really help you. The approach of @MerlijnWajer sounds great though. Talk about what can be shared, and what can be just re-used as separate interfacing binaries.
I was experimenting with a script a while ago but couldn't get it to fully work on oddball PDFs and then gave up for a bit. But I think I just realised that at least for PDFs generated by OCRmyPDF, this is a non-issue. Does anyone have some sample/test PDFs created by OCRMyPDF that I could run my script on?
OK, I installed it on a debian machine and ran a few tests. It seems to work at least for my basic testing (see attached files, input image, ocrmypdf output given input image, MRC compressed pdf)
The text layer and document metadata seems untouched, and pdfimages output seems sensible:
$ pdfimages -list /tmp/ocrmypdf.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2472 3484 rgb 3 8 jpeg no 12 0 762 762 635K 2.5%
$ pdfimages -list /tmp/out.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2472 3484 rgb 3 8 jpx no 16 0 762 762 12.8K 0.1%
1 1 image 2472 3484 rgb 3 8 jpx no 17 0 762 762 62.6K 0.2%
1 2 smask 2472 3484 gray 1 1 jbig2 no 17 0 762 762 41.9K 4.0%
Sorry for the delay, but it looks like this is workable, so I could clean up the code and we can do some more testing?
VeraPDF also doesn't seem to complain:
~/verapdf/verapdf --format text --flavour 2b /tmp/out.pdf
PASS /tmp/out.pdf
Here is my compression script from a few months back, it's very much work in progress so please don't use it for any production purposes (but of course, please test and report back):
https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py (apologies for the mess, it is a -test- script)
The only argument is the input pdf, and then it will save the compressed PDF to /tmp/out.pdf
. You will need archive-pdf-tools==1.4.13
installed (available via pip). Depending on which code is commented it can compress JPEG2000 using Pillow, JPEG using jpegoptim, or JPEG2000 using kakadu.
If this test code/script seems to do the job, I can extend it to also support conversion to bitonal ccitt/jbig2 (as mentioned in #906) given a flag or something and tidy it up.
As stated earlier, complex PDFs with many images and transparency don't work well yet, but for that I'd have to look at the transformations of the pages, the images, transparency, etc... which I don't think is an issue for OCRmyPDF compression use cases?
One thing that I'd like to add is to extract the text layer from a PDF to hOCR, so that it can be used as input for the script, so that it knows where the text areas are. This is actually not far off at all, I already have some local code for it, so depending on the feedback here I can could try to integrate that.
I tried your script on a newly arrived ABN AMRO-letter of two pages. The resulting out.pdf is 129 kb, and the letters ABN AMRO on top are quite vague. DjvuSolo 3.1/DjVuToy reach 46 kb with sharper ABN AMRO letters and less fuzz around the pricing table.
I had to compile Leptonica 1.72, as the suggested leptonica 1.68 in jbig2enc didn't compile right with libpng-dev. I used an Ubuntu 20 image on Azure
sudo apt-get update
sudo apt-get install automake git libtool libpng-dev build-essential make ocrmypdf pip
pip install archive-pdf-tools==1.4.13
vi ~/.bashrc
export PATH=$PATH:/home/rmast/.local/bin
git clone https://github.com/DanBloomberg/leptonica.git
git clone https://github.com/agl/jbig2enc.git
wget https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py
cd leptonica/
git checkout v1.72
chmod +x configure
./configure
make
sudo make install
cd ../jbig2enc/
./autogen.sh
./configure
make
sudo make install
Right, the current code is also inferior to what the normal tooling does since that uses the text layer info as well, but once I add that (I will try to do that soon), it could be better.
DjVu is a fun comparison but it has the advantage of being able to use image formats that are not supported in PDF.
DjVu is a fun comparison but it has the advantage of being able to use image formats that are not supported in PDF.
That's where DjVuToy comes in, that converts the DjVu-result of DjVuSolo3.1 to a JBIG2/JPEG2000 PDF of 46kb. The DjVu itself is only 31kb.
I can't find the source for that program. Is it free software? (If not: maybe another issue/place would be better to discuss that?)
No, both are closed source. DjVuSolo3.1 is a very old pre-commercial demo of the capabilities of DjVu. When they commercialized DjVu they rated it at such high prices that DjVu priced itself out of the market. I guess the Internet Archive once used DjVu. DjVuToy is actively maintained by a Chinese enthousiast, but he's not planning on opening the source.
Here the result via DjVuSolo3.1/DjVuToy3.06 unicode edition, half as small as your result from the Covid-health-form:
rmast@Ubuntu20:~$ pdfimages -list in.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 824 1162 rgb 3 8 jpx yes 1 0 100 101 3080B 0.1%
1 1 stencil 2472 3484 - 1 1 jbig2 no 3 0 300 300 17.6K 1.7%
1 2 stencil 2472 3484 - 1 1 jbig2 no 4 0 300 300 347B 0.0%
1 3 stencil 2472 3484 - 1 1 jbig2 no 5 0 300 300 68B 0.0%
1 4 stencil 2472 3484 - 1 1 jbig2 no 6 0 300 300 2137B 0.2%
1 5 stencil 2472 3484 - 1 1 jbig2 no 7 0 300 300 618B 0.1%
1 6 stencil 2472 3484 - 1 1 jbig2 no 8 0 300 300 984B 0.1%
1 7 stencil 2472 3484 - 1 1 jbig2 no 9 0 300 300 357B 0.0%
1 8 stencil 2472 3484 - 1 1 jbig2 no 10 0 300 300 6063B 0.6%
1 9 stencil 2472 3484 - 1 1 jbig2 no 11 0 300 300 324B 0.0%
1 10 stencil 2472 3484 - 1 1 jbig2 no 12 0 300 300 11.2K 1.1%
1 11 stencil 2472 3484 - 1 1 jbig2 no 13 0 300 300 125B 0.0%
1 12 stencil 2472 3484 - 1 1 jbig2 no 14 0 300 300 114B 0.0%
1 13 stencil 2472 3484 - 1 1 jbig2 no 15 0 300 300 322B 0.0%
1 14 stencil 2472 3484 - 1 1 jbig2 no 16 0 300 300 129B 0.0%
1 15 stencil 2472 3484 - 1 1 jbig2 no 17 0 300 300 246B 0.0%
1 16 stencil 2472 3484 - 1 1 jbig2 no 18 0 300 300 210B 0.0%
1 17 stencil 2472 3484 - 1 1 jbig2 no 19 0 300 300 335B 0.0%
1 18 stencil 2472 3484 - 1 1 jbig2 no 20 0 300 300 194B 0.0%
1 19 stencil 2472 3484 - 1 1 jbig2 no 21 0 300 300 74B 0.0%
1 20 stencil 2472 3484 - 1 1 jbig2 no 22 0 300 300 170B 0.0%
1 21 stencil 2472 3484 - 1 1 jbig2 no 23 0 300 300 349B 0.0%
1 22 stencil 2472 3484 - 1 1 jbig2 no 24 0 300 300 325B 0.0%
1 23 stencil 2472 3484 - 1 1 jbig2 no 25 0 300 300 109B 0.0%
1 24 stencil 2472 3484 - 1 1 jbig2 no 26 0 300 300 139B 0.0%
1 25 stencil 2472 3484 - 1 1 jbig2 no 27 0 300 300 271B 0.0%
1 26 stencil 2472 3484 - 1 1 jbig2 no 28 0 300 300 913B 0.1%
1 27 stencil 2472 3484 - 1 1 jbig2 no 29 0 300 300 138B 0.0%
1 28 stencil 2472 3484 - 1 1 jbig2 no 30 0 300 300 113B 0.0%
1 29 stencil 2472 3484 - 1 1 jbig2 no 31 0 300 300 116B 0.0%
1 30 stencil 2472 3484 - 1 1 jbig2 no 32 0 300 300 117B 0.0%
1 31 stencil 2472 3484 - 1 1 jbig2 no 33 0 300 300 401B 0.0%
1 32 stencil 2472 3484 - 1 1 jbig2 no 34 0 300 300 202B 0.0%
rmast@Ubuntu20:~$ ls -al in.pdf
-rw-rw-r-- 1 rmast rmast 58988 May 5 18:00 in.pdf
The many jbig2-pictures stem from all the colors in the JB2-picture. DjVuToy translates those to separate images with their own color.
Especially take a look at the clearness of the background picture...
So I've cleaned up the code a bit and am looking for some people to try and run it on their OCRMyPDF results. (Let's not focus on DjVu stuff here please, as I'm trying to make a tool that people can use based on existing/working code)
You'll need this build of archive-pdf-tools: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2477636215 (just click on the artifact download link and pick the release for your os/python interpreter from the artifact.zip
)
And then download this script: https://archive.org/~merlijn/pdfcomp.py
Use like so:
$ python pdfcomp.py /tmp/ocrmypdf.pdf /tmp/ocrmypdf_comp.pdf
Compression factor: 5.193651663405088
Some random notes...
- The tool will also extract the text from the PDF and use that to aid in the compression. It has seen light testing and I don't recommend running it on PDFs that weren't made by OCRmyPDF (and perhaps just Tesseract) at this point
- The text layer should be fully intact -- in fact, everything else should be too.
- The tool will also leave all your PDF metadata alone, so the creator metadata will still be OCRmyPDF / Tesseract:
$ grep -a Tess /tmp/ocrmypdf_comp.pdf
/Creator (ocrmypdf 6.1.2 / Tesseract OCR-PDF 4.1.3)
<xmp:CreatorTool>ocrmypdf 6.1.2 / Tesseract OCR-PDF 4.1.3</xmp:CreatorTool></rdf:Description>