OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Introduce a way to radically reduce the output file size (sacrificing image quality)

Open heinrich-ulbricht opened this issue 4 years ago • 75 comments

Is your feature request related to a problem? Please describe. My use case is "scanning" documents with a smartphone camera, then archiving those "scans" as low-quality monochrome images. But OCR should be done beforehand on the high-quality images.

I describes this in more detail here: https://github.com/jbarlow83/OCRmyPDF/issues/443#issuecomment-618589203

Furthermore I see a discussion covering a similar topic here: #293

Describe the solution you'd like I want greater control of image quality for the images embedded into the PDF (after doing OCR). I can imagine those possible solutions (each point is a complete solution):

  • add a parameter that forces all images to be converted to 1 bpp images (low effort)
  • add a parameter allowing arbitrary shell commands to be passed that will be executed by OCRmyPDF on the images in the temporary folder, before OCRmyPDF handles them further (high effort, security implications?)
  • introduce multiple parameters that allow for more control of the things that go on in the optimization step (probably here) (medium effort?)

Additional context I'm currently evaluating how to achieve my goal with the least effort. I see two approaches:

  1. let OCRmyPDF do it's thing on high quality images/PDFs; post-process manually using pikepdf using a Python script that replaces the high quality images with low quality ones in the PDF (I have a working PoC, but it's not pretty)
  2. modify OCRmyPDF

I'm not sure about the second approach - where would be a good point to start? One approach could be:

  1. using PNG images in the input PDF file, then
  2. forcing pngquant to convert them to 1 bpp (here?)
  3. this could trigger PNG rewriting as G4 (here)

@jbarlow83 Does this sound right?

heinrich-ulbricht avatar Apr 24 '20 09:04 heinrich-ulbricht

I would go with modifying ocrmypdf, and:

  1. Always input JPG
  2. Replace pngquant.quantize with code that always converts the image to 1bpp (e.g. just use PIllow).
  3. You will actually want to install jbig2enc. JBIG2 outperforms G4 in size and is still widely supported. 1bpp PNGs will always be converted to JBIG2 when a jbig2 encoder is available. You might even want JBIG2 in lossy mode, provided the dangers of lossy mode are acceptable to you (see documentation and the "6-8" problem).

Instead of forcing PNG input, you could also uncomment the optimize.py:523 "try pngifying the jpegs" which as the name suggests, speculatively converts JPEGs to PNGs. I believe this had a few corner cases to be worked out and is too costly in performance in the typical case, but you could try that, especially if you are forcing everything to JBIG2 anyway.

jbarlow83 avatar Apr 24 '20 09:04 jbarlow83

I'm giving it a try and am having some success.

@jbarlow83 A question: This return doesn't look right since it leaves the function after handling only one image. Is this ok?

https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L426

For me this leads to only one of multiple images being handled in a multi-page PDF, where each page contains one image. (Since the loop cannot finish.)

And one (related?) curiosity: I managed to modify the conversion pipeline such that I now have multiple 1 bpp PNGs waiting in the temp folder to be handled. If there is only one such PNG the resulting PDF looks fine. If there are multiple such images the resulting PDF is distorted. Looking at the images in the temp folder I got:

  • my quality-reduced PNGs
  • the corresponding generated TIFs - each looking good

Then the code converts those TIFs to JBIG2 file(s) by invoking the jbig2 tool. This seems to be errorneous if there are multiple TIFs (leading to distortions in the final PDF). It works for one TIF though. So the question is: do you have a test in place checking that PDFs with multiple 1 bpp images can correctly be converted to the JBIG2 format? Or could this be a bug?

Note: I suspect that above mentioned return prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.

(But this might also be me not understanding how the final JBIG2 handling works. I might have broken something with my modifications.)

Edit: the debug output shows me this command line that is being used by OCRmyPDF:

  DEBUG - Running: ['jbig2', '-b', 'group00000000', '-s', '-p', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000032.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000028.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000030.tif']

The TIF files look good.

heinrich-ulbricht avatar Apr 24 '20 13:04 heinrich-ulbricht

I found the reason why my PDF containing the 1 bpp JBIG2 images was distorted. The color space of the embedded images was not correct. It was still /DeviceRGB: image But correct would be /DeviceGray.

I was able to quick-fix this by inserting im_obj.ColorSpace = Name("/DeviceGray") right before this line: https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L340 The PDF now looks good.

Hypothesis: it was never intended to change the color space during image optimization?

Edit: Suggested fix:

if (Name.BitsPerComponent in im_obj and im_obj.BitsPerComponent == 1):
  log.debug("Setting ColorSpace to /DeviceGray")
  im_obj.ColorSpace = Name("/DeviceGray")

Edit2: Better fix? Add im_obj.ColorSpace = Name("/DeviceGray") here: https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L430

heinrich-ulbricht avatar Apr 24 '20 19:04 heinrich-ulbricht

I implemented and pushed a solution that works for me and is basically a shortcut to TIF generation (see above linked commit). I added a new user script option that can be used to run arbitrary shell commands on images. This user script takes the source and destination file pathes as input parameter and must convert the source image to a 1 bpp TIF.

The shell script that works for me looks like this:

#!/bin/sh
convert -colorspace gray -fill white -sharpen 0x2 "$1" - | jpegtopnm | pamthreshold | pamtotiff -g4 > "$2"

This requires ImageMagick and netpbm-progs to be installed. But one could use other conversion tools here as well. pamthreshold implements a nice dynamic threshold.

The command that I used to test looks like this:

ocrmypdf --user-script-jpg-to-1bpp-tif shell.sh --jbig2-lossy -v 1 -O3 in.pdf out.pdf

  • in.pdf: 791 KB (created from three colored JPGs using img2pdf)
  • out.pdf: 72 KB
  • Optimize ratio: 11.19 savings: 91.1%

I'm not opening a pull request since the solution is very narrow to my use case. And right now it only handles JPEG images. But maybe somebody finds this useful as a starting point.

heinrich-ulbricht avatar Apr 24 '20 22:04 heinrich-ulbricht

I suspect that above mentioned return prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.

You are correct, those returns are wrong and will suppress multiple images per file. That's a great catch.

Hypothesis: it was never intended to change the color space during image optimization?

Also correct. /DeviceGray is not correct in general, but probably suitable for your use case. Some files will specify a complex colorspace instead of /DeviceRGB and changing to /DeviceGray may not be correct, so optimize tries to avoid changing colorspace. It is also possible to specify a 1-bit color colorspace, e.g. 0 is blue and 1 is red.

I'm not opening a pull request since the solution

Agreed - that's a lot of new dependencies to add.

jbarlow83 avatar Apr 25 '20 09:04 jbarlow83

I also needed exactly this!

I tried to rebase unto master, missed some things in the manual merges required and added them afterwards, so my branch doesn’t look so clean right now. But here it is: https://github.com/andersjohansson/OCRmyPDF/tree/feature/github-541-reduce-output-file-size-v10

It works fine now though! Thanks!

andersjohansson avatar Jun 25 '20 14:06 andersjohansson

userscript.py could be structured as a plugin instead (new feature for 10.x). You'd need to create a new hook as well by adding it to pluginspec.py, and then we could have a generic, pluggable interface for people who want to optimize images more aggressively.

jbarlow83 avatar Jun 25 '20 22:06 jbarlow83

If @heinrich-ulbricht or anyone else is interested in looking more into this in the future, see also the comments that @jbarlow83 added here: https://github.com/andersjohansson/OCRmyPDF/commit/4e5b68f1b966312edeba8ef3b6e12037bac8aef6

andersjohansson avatar Jun 26 '20 08:06 andersjohansson

What about using MRC compression to visually keep the file as much as the original but loosing lots of size as @jbarlow83 mentioned here:

https://github.com/jbarlow83/OCRmyPDF/issues/836#issuecomment-922560147

(We do not do page color segmentation at this time, i.e., finding regions of a page or image that can be represented with a reduced colorspace. It's not an easy feature to implement and will probably need a corporate sponsor so that I can work on it full time for a few weeks. You do get better compression if you're able to work with the original PDFs.)

You could just look at how closed source DjVuSolo 3.1 does reach astonishing sizes with really legible results, and even keeping color in JBIG2-like JB2. With DjVuToy you can transform those DjVu's into PDF's that are only about twice as big.

With https://github.com/jwilk/didjvu there has been an attempt to open source this MRC-mechanism, however with some inconveniences that keep files too big to be a serious candidate to replace the old DjVuSolo 3.1 in the Russian user group.

However many DjVu-patents have expired, so there might be some valuable MRC-knowledge in those patents, as @jsbien suggested.

https://github.com/jwilk/didjvu/issues/18

rmast avatar Nov 07 '21 20:11 rmast

@rmast This is interesting information and could be helpful if I ever get the opportunity to implement this. Thanks.

jbarlow83 avatar Nov 08 '21 07:11 jbarlow83

(Found this through @rmast) -- If you're looking for a MRC implementation, https://github.com/internetarchive/archive-pdf-tools does this when it creates PDFs with text layers (it's mostly like OCRMyPDF but doesn't attempt to do OCR and requires that be done externally) - the MRC code can also be used as a library, although I probably need to make the API a bit more ... accessible. @jbarlow83 - If you're interested in this I could try to make a more accessible API. Alternatively, I could look at improving the "pdf recoding" method some where the software compresses an existing PDF by replacing the images with MRC compression images, so then one could just run recode_pdf after OCRmyPDF has done its thing.

MerlijnWajer avatar Nov 24 '21 19:11 MerlijnWajer

@MerlijnWajer Thanks for the suggestion - that is impressive work. Unfortunately it's license-incompatible (AGPL) and also uses PyMuPDF as its PDF generator. I like PyMuPDF and used it previously, but it relies on libmupdf which is only released as a static library and doesn't promise a stable interface, meaning that Linux distributions won't include it.

But setting it up through a plugin interface, calling recode_pdf by command line, would certainly be doable.

jbarlow83 avatar Nov 24 '21 22:11 jbarlow83

I'll try to implement this mode (modifying the images of a PDF without touching most other parts) in the next week or so and report back, then we could maybe look at the plugin path. (Actually, give me more like two weeks, I'll have to do some refactoring to support this recoding mode)

MerlijnWajer avatar Nov 25 '21 12:11 MerlijnWajer

It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.

jbarlow83 avatar Nov 25 '21 22:11 jbarlow83

It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.

Right - I'll have to think about that (and also ask). For now I will try to get a tool to recode an existing PDF working first, since I've been wanting to add/implement that for a long time anyway, and this is a great motivation to do it. I'll also make the MRC API more usable (current code is heavily optimised for performance, not for API usability), though, so we could revisit the potential license situation once that is done.

MerlijnWajer avatar Nov 25 '21 22:11 MerlijnWajer

@blaueente @v217 I saw your input in these issues concerning introducing MRC into OCRMyPDF: https://github.com/ocrmypdf/OCRmyPDF/issues/9 https://github.com/fritz-hh/OCRmyPDF/issues/88

I understand license-(in)compatibility is inhibiting progress.

I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.

Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?

rmast avatar Jan 01 '22 23:01 rmast

@blaueente @v217 I saw your input in these issues concerning introducing MRC into OCRMyPDF: #9 fritz-hh/OCRmyPDF#88

I understand license-(in)compatibility is inhibiting progress.

I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.

Didjvu itself mainly deals with organizing everything, so I guess one couldn't use code from it directly anyways. C44 / iw44 is the wavelet codec used by didjvu, and therefore unusable for PDF MRCs. The ideas of archive-pdf-tools seem pretty good to me, maybe they could learn from gamera's separation algorithms, and the ROI-style coding of iw44, although I see good discussions in their github page.

Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?

Regarding licenses, I can't really help you. The approach of @MerlijnWajer sounds great though. Talk about what can be shared, and what can be just re-used as separate interfacing binaries.

blaueente avatar Jan 04 '22 20:01 blaueente

I was experimenting with a script a while ago but couldn't get it to fully work on oddball PDFs and then gave up for a bit. But I think I just realised that at least for PDFs generated by OCRmyPDF, this is a non-issue. Does anyone have some sample/test PDFs created by OCRMyPDF that I could run my script on?

MerlijnWajer avatar May 03 '22 23:05 MerlijnWajer

OK, I installed it on a debian machine and ran a few tests. It seems to work at least for my basic testing (see attached files, input image, ocrmypdf output given input image, MRC compressed pdf)

example.tar.gz

The text layer and document metadata seems untouched, and pdfimages output seems sensible:

$ pdfimages -list /tmp/ocrmypdf.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2472  3484  rgb     3   8  jpeg   no        12  0   762   762  635K 2.5%

$ pdfimages -list /tmp/out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2472  3484  rgb     3   8  jpx    no        16  0   762   762 12.8K 0.1%
   1     1 image    2472  3484  rgb     3   8  jpx    no        17  0   762   762 62.6K 0.2%
   1     2 smask    2472  3484  gray    1   1  jbig2  no        17  0   762   762 41.9K 4.0%

Sorry for the delay, but it looks like this is workable, so I could clean up the code and we can do some more testing?

MerlijnWajer avatar May 03 '22 23:05 MerlijnWajer

VeraPDF also doesn't seem to complain:

~/verapdf/verapdf --format text --flavour 2b /tmp/out.pdf
PASS /tmp/out.pdf

MerlijnWajer avatar May 03 '22 23:05 MerlijnWajer

Here is my compression script from a few months back, it's very much work in progress so please don't use it for any production purposes (but of course, please test and report back):

https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py (apologies for the mess, it is a -test- script)

The only argument is the input pdf, and then it will save the compressed PDF to /tmp/out.pdf. You will need archive-pdf-tools==1.4.13 installed (available via pip). Depending on which code is commented it can compress JPEG2000 using Pillow, JPEG using jpegoptim, or JPEG2000 using kakadu.

If this test code/script seems to do the job, I can extend it to also support conversion to bitonal ccitt/jbig2 (as mentioned in #906) given a flag or something and tidy it up.

As stated earlier, complex PDFs with many images and transparency don't work well yet, but for that I'd have to look at the transformations of the pages, the images, transparency, etc... which I don't think is an issue for OCRmyPDF compression use cases?

MerlijnWajer avatar May 03 '22 23:05 MerlijnWajer

One thing that I'd like to add is to extract the text layer from a PDF to hOCR, so that it can be used as input for the script, so that it knows where the text areas are. This is actually not far off at all, I already have some local code for it, so depending on the feedback here I can could try to integrate that.

MerlijnWajer avatar May 03 '22 23:05 MerlijnWajer

I tried your script on a newly arrived ABN AMRO-letter of two pages. The resulting out.pdf is 129 kb, and the letters ABN AMRO on top are quite vague. DjvuSolo 3.1/DjVuToy reach 46 kb with sharper ABN AMRO letters and less fuzz around the pricing table.

I had to compile Leptonica 1.72, as the suggested leptonica 1.68 in jbig2enc didn't compile right with libpng-dev. I used an Ubuntu 20 image on Azure

sudo apt-get update
sudo apt-get install automake git libtool libpng-dev build-essential make ocrmypdf pip
pip install archive-pdf-tools==1.4.13
vi ~/.bashrc
export PATH=$PATH:/home/rmast/.local/bin
git clone https://github.com/DanBloomberg/leptonica.git

git clone https://github.com/agl/jbig2enc.git
wget https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py

cd leptonica/
git checkout v1.72
chmod +x configure
./configure
make
sudo make install

cd ../jbig2enc/
./autogen.sh
./configure
make
sudo make install

rmast avatar May 04 '22 11:05 rmast

Right, the current code is also inferior to what the normal tooling does since that uses the text layer info as well, but once I add that (I will try to do that soon), it could be better.

DjVu is a fun comparison but it has the advantage of being able to use image formats that are not supported in PDF.

MerlijnWajer avatar May 04 '22 12:05 MerlijnWajer

DjVu is a fun comparison but it has the advantage of being able to use image formats that are not supported in PDF.

That's where DjVuToy comes in, that converts the DjVu-result of DjVuSolo3.1 to a JBIG2/JPEG2000 PDF of 46kb. The DjVu itself is only 31kb.

rmast avatar May 04 '22 13:05 rmast

I can't find the source for that program. Is it free software? (If not: maybe another issue/place would be better to discuss that?)

MerlijnWajer avatar May 04 '22 13:05 MerlijnWajer

No, both are closed source. DjVuSolo3.1 is a very old pre-commercial demo of the capabilities of DjVu. When they commercialized DjVu they rated it at such high prices that DjVu priced itself out of the market. I guess the Internet Archive once used DjVu. DjVuToy is actively maintained by a Chinese enthousiast, but he's not planning on opening the source.

rmast avatar May 04 '22 13:05 rmast

Here the result via DjVuSolo3.1/DjVuToy3.06 unicode edition, half as small as your result from the Covid-health-form:

in.pdf

rmast@Ubuntu20:~$ pdfimages -list in.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     824  1162  rgb     3   8  jpx    yes        1  0   100   101 3080B 0.1%
   1     1 stencil  2472  3484  -       1   1  jbig2  no         3  0   300   300 17.6K 1.7%
   1     2 stencil  2472  3484  -       1   1  jbig2  no         4  0   300   300  347B 0.0%
   1     3 stencil  2472  3484  -       1   1  jbig2  no         5  0   300   300   68B 0.0%
   1     4 stencil  2472  3484  -       1   1  jbig2  no         6  0   300   300 2137B 0.2%
   1     5 stencil  2472  3484  -       1   1  jbig2  no         7  0   300   300  618B 0.1%
   1     6 stencil  2472  3484  -       1   1  jbig2  no         8  0   300   300  984B 0.1%
   1     7 stencil  2472  3484  -       1   1  jbig2  no         9  0   300   300  357B 0.0%
   1     8 stencil  2472  3484  -       1   1  jbig2  no        10  0   300   300 6063B 0.6%
   1     9 stencil  2472  3484  -       1   1  jbig2  no        11  0   300   300  324B 0.0%
   1    10 stencil  2472  3484  -       1   1  jbig2  no        12  0   300   300 11.2K 1.1%
   1    11 stencil  2472  3484  -       1   1  jbig2  no        13  0   300   300  125B 0.0%
   1    12 stencil  2472  3484  -       1   1  jbig2  no        14  0   300   300  114B 0.0%
   1    13 stencil  2472  3484  -       1   1  jbig2  no        15  0   300   300  322B 0.0%
   1    14 stencil  2472  3484  -       1   1  jbig2  no        16  0   300   300  129B 0.0%
   1    15 stencil  2472  3484  -       1   1  jbig2  no        17  0   300   300  246B 0.0%
   1    16 stencil  2472  3484  -       1   1  jbig2  no        18  0   300   300  210B 0.0%
   1    17 stencil  2472  3484  -       1   1  jbig2  no        19  0   300   300  335B 0.0%
   1    18 stencil  2472  3484  -       1   1  jbig2  no        20  0   300   300  194B 0.0%
   1    19 stencil  2472  3484  -       1   1  jbig2  no        21  0   300   300   74B 0.0%
   1    20 stencil  2472  3484  -       1   1  jbig2  no        22  0   300   300  170B 0.0%
   1    21 stencil  2472  3484  -       1   1  jbig2  no        23  0   300   300  349B 0.0%
   1    22 stencil  2472  3484  -       1   1  jbig2  no        24  0   300   300  325B 0.0%
   1    23 stencil  2472  3484  -       1   1  jbig2  no        25  0   300   300  109B 0.0%
   1    24 stencil  2472  3484  -       1   1  jbig2  no        26  0   300   300  139B 0.0%
   1    25 stencil  2472  3484  -       1   1  jbig2  no        27  0   300   300  271B 0.0%
   1    26 stencil  2472  3484  -       1   1  jbig2  no        28  0   300   300  913B 0.1%
   1    27 stencil  2472  3484  -       1   1  jbig2  no        29  0   300   300  138B 0.0%
   1    28 stencil  2472  3484  -       1   1  jbig2  no        30  0   300   300  113B 0.0%
   1    29 stencil  2472  3484  -       1   1  jbig2  no        31  0   300   300  116B 0.0%
   1    30 stencil  2472  3484  -       1   1  jbig2  no        32  0   300   300  117B 0.0%
   1    31 stencil  2472  3484  -       1   1  jbig2  no        33  0   300   300  401B 0.0%
   1    32 stencil  2472  3484  -       1   1  jbig2  no        34  0   300   300  202B 0.0%
rmast@Ubuntu20:~$ ls -al in.pdf
-rw-rw-r-- 1 rmast rmast 58988 May  5 18:00 in.pdf

The many jbig2-pictures stem from all the colors in the JB2-picture. DjVuToy translates those to separate images with their own color.

rmast avatar May 05 '22 18:05 rmast

Especially take a look at the clearness of the background picture...

rmast avatar May 05 '22 18:05 rmast

So I've cleaned up the code a bit and am looking for some people to try and run it on their OCRMyPDF results. (Let's not focus on DjVu stuff here please, as I'm trying to make a tool that people can use based on existing/working code)

You'll need this build of archive-pdf-tools: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2477636215 (just click on the artifact download link and pick the release for your os/python interpreter from the artifact.zip)

And then download this script: https://archive.org/~merlijn/pdfcomp.py

Use like so:

$ python pdfcomp.py /tmp/ocrmypdf.pdf /tmp/ocrmypdf_comp.pdf
Compression factor: 5.193651663405088

Some random notes...

  • The tool will also extract the text from the PDF and use that to aid in the compression. It has seen light testing and I don't recommend running it on PDFs that weren't made by OCRmyPDF (and perhaps just Tesseract) at this point
  • The text layer should be fully intact -- in fact, everything else should be too.
  • The tool will also leave all your PDF metadata alone, so the creator metadata will still be OCRmyPDF / Tesseract:
$ grep -a Tess /tmp/ocrmypdf_comp.pdf
  /Creator (ocrmypdf 6.1.2 / Tesseract OCR-PDF 4.1.3)
<xmp:CreatorTool>ocrmypdf 6.1.2 / Tesseract OCR-PDF 4.1.3</xmp:CreatorTool></rdf:Description>

MerlijnWajer avatar Jun 10 '22 21:06 MerlijnWajer