OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Reducing PDF size by splitting the image and compressing each area differently

Open rdiez opened this issue 2 years ago • 3 comments

I have a big Canon printer at work. When I scan a document as PDF without the 'compact' option, the PDF has just one big JPEG inside.

However, when I scan a document as PDF with options 'compact' and 'OCR' turned on (which seems to be implemented by some Adobe software in the printer), it does a pretty good job:

  • Areas with black-and-white text are scanned as separate pictures, at 300 DPI, and are compressed with a monochrome algorithm (CCITT or JBIG2).
  • Everything else (like company logos) is scanned as a single 150 DPI colour or grayscale JPEG, with the black-and-white text areas removed beforehand.

Those pictures are probably transparent and stacked in the PDF, so that you do not see any visual differences compared with PDFs scanned without the 'compact' option (with a single picture per page).

OCR seems to be performed on the 300 DPI black-and-white areas. If you turn the 'compact' option off, and scan as a single 150 DPI image, the OCR results are worse.

The resulting PDF is very small, around 1/2 what you would achieve by aggressively compressing a single 300 DPI JPEG picture per page, and it still looks good. I am attaching to this issue a small file I scanned this way for test purposes.

A have seen another commercial OCR software, eCopy ShareScan with Omnipage, that does a similar thing.

I do not know what the order of operations is:

  • Possibility 1) The image is broken down first, using some simple heuristic, and then OCR is done over the black-and-white areas.
  • Possibility 2) OCR is performed over the whole image, and the OCR engine reports the text boundaries.
    Those boundaries and then used to split the image and apply different image compression algorithms to each part.
    For example, Tesseract can return the bounding box of each piece of text. Apparently, it is easy to retrieve this information when invoking Tesseract with pytesseract.

Does anybody know how this technique of breaking up a scanned picture for OCR and space-saving purposes is called? I did not find much information about this subject on the Internet.

I do not know of any open-source tool capable of optimising scanned images for OCR purposes in this way. Does anybody know one?

I wonder if OCRmyPDF could some day implement this ability to identify text areas, break up the image accordingly, and compress each resulting image with a different algorithm.

rdiez avatar Feb 06 '22 21:02 rdiez

Discussed here - #541. Multi resolution compression is main name for this. Image segmentation is the main technique to label pixels as belonging to different categories.

DPI adjustments aren't necessarily as efficient as making the compression ratio more aggressive. That is why the existing optimizer favors lowering JPEG quality and using pngquant over downsampling.

Mainly I can't devote the time to working on a new feature of this complexity right now, and probably not for several months. Also, there may be an open source answer by way of archive.org, but we're not license compatible.

jbarlow83 avatar Feb 06 '22 22:02 jbarlow83

About the DPI adjustments not being necessary: It is probably true that it's better to increase the JPEG compression ratio than to downsample the image first. The difficulty is how to make it work in practice with OCRmyPDF.

I have a large collection of mixed PDFs that come from different scanners and were scanned with different settings. Some have 300 DPI images, some 200 DPI, some 150 DPI. Some use the Mixed Raster Content technique, others do not. Some have an OCR text overlay, others do not. The collection is getting bigger by the day, so that the disk is now full. It is probably cheaper to buy a new disk, and that is what I will probably be doing next, but nevertheless I found this PDF optimisation problem interesting.

The optimisation rule I came up with is: OCR probably needs more resolution, but afterwards, 150 DPI is enough for long-term archiving purposes, even if highly compressed with JPEG.

As far as I can tell, I can pass to OCRmyPDF command-line option "--optimize 3", but I have no finer control. Is that right?

I compared the following:

a) OCRmyPDF 300 DPI with --optimize 3

b) downsampling 300 DPI to 150 DPI first , and then OCRmyPDF with --optimize 3

And (b) still produced an acceptable quality with a much lower PDF file size.

The trick would be to tell OCRmyPDF something along this line:

  1. If the image is 150 DPI, --optimize 3 is fine.

  2. If the image is 300 DPI, optimize even harder so that the resulting size is comparable to a 150 DPI image processed with --optimize 3.

How do I achieve that?

rdiez avatar Feb 07 '22 10:02 rdiez