archive-pdf-tools Usefulness of MRC for decent quality compression of scanned book pages with illustrations

Opening a new issue as requested.

Here are some samples: https://mega.nz/folder/BRhChKob#xo-HHaJrD9VYN6YV3ur9WA

128.tif & 188.tif - original cleaned up 600dpi scans *-scantailor.tif - 600dpi mixed output with bitonal text and color photos, as autodetected *-scantailor-pdfbeads.pdf - above .tif split into two layers, with the text layer jbig2-encoded and the background layer JP2-encocded downsampled to 150dpi, and everything encoded in a pdf using pdfbeads *.jp2 - some compressed versions of the original, forgot the settings. Page 128 is almost half as small as the PDF's so I assume PDF sizes can be slightly improved.

The folders have some residual files. ScanTailor itself can now split tiffs, though I have no idea how to merge them as layers in a PDF. (That would be useful to learn.)

Can MRC output get to be anything comparable to the PDFs at the same or lower size? I'm also curious whether it can be achieved directly from the original cleaned up scan or the ScanTailor mixed output step is still advised.

Dec 06 '21 15:12 fusefib

I took a look and I have a few thoughts. The damage to the photos comes mostly from parts of the photo being marked as background and others as foreground. Ultimately, MRC is not ideal for photos, but I think we can come up with something that is quite workable if we can figure out what parts are just images.

Having photo information in the hOCR file can help us identify photo regions (see https://github.com/internetarchive/archive-pdf-tools/issues/23). This needs to be added to Tesseract's hOCR renderer and a colleague of mine is working on it.
Scantailor seems to generate useful output where it separates the background and the foreground (alternative to (1)). I need to think about how we can use that output exactly (maybe you could provide a mask where we are sure there is only background). One option would be to have a way to already provide the background and foreground images separately, but I'd have to rewrite parts of the code to make that work without too many hacks.
The photos in your image have a lot of digitisation/camera noise, blurring some of those parts might also help with mask generation or compression. However, currently archive-pdf-tools will only blur an entire image/page if it deems it too noisy.

If we have a good idea of what is text and what is photo, we can attempt to use JPEG2000 Region of Interest encoding, and we will also have the mask exclude any/all parts of the photo. Then we can encode the photo as part of the background, and try to get higher quality at the regions where we think we have photos. openjpeg/grok has some form of ROI and kakadu also has -roi in combination with Rshift/Rweight/Rlevels. I haven't gotten this to work in the past, but maybe we ought to re-try when we have ocr_photo support in hOCR files.

So to summarise, I think we can make the software handle this better if it knows what regions are images, ideally we get that from the hOCR file, but we can think of another way provide that information through scantailor (or custom code).

BTW: You can still get better compression currently than pdfbeads by just providing higher quality --bg-compression-flags and --fg-compression-flags.

Useful links:

https://github.com/uclouvain/openjpeg/issues/924
https://webdocs.cs.ualberta.ca/~anup/Courses/604/NOTES/PrioritizedJPEG2000.pdf
https://www.researchgate.net/publication/4072258_Region-based_guaranteed_image_quality_in_JPEG2000

Dec 07 '21 23:12 MerlijnWajer

This link: https://www.researchgate.net/publication/281283716_The_Significance_of_Image_Compression_in_Plant_Phenotyping_Applications

Suggestion to perform roi like this:

Lossy (ROI): kdu_compress -no_weights Rshift=16 Rlevels=5 
-roi roifile,0.5 -rate r

I could give that a try later this week.

Dec 07 '21 23:12 MerlijnWajer

The folders have some residual files. ScanTailor itself can now split tiffs, though I have no idea how to merge them as layers in a PDF. (That would be useful to learn.)

I've not seen that working either. @trufanov-nok has some similar work on getting those split files in a .djvu, but I've not tried them yet.

Dec 07 '21 23:12 rmast

"Merging" them as layers is not possible in PDFs, but you can have images on top of each other with transparency. Or you can merge them before you insert them. But that wouldn't be necessary if we try use some of my above comments. I have never used scantailor but it looks cool, maybe we can support using scantailor to clean up documents some.

Dec 07 '21 23:12 MerlijnWajer

I saw JPEG2000 also has a composite JPM format, meant for MRC. I don't know if that has more possibilities than PDF already has, but as JPEG2000 is part of PDF one would expect those JPM-possibilities to be usable in a PDF.

Dec 08 '21 00:12 rmast

I don't think it really matters, you'd still be encoding the JBIG2 and JPEG2000 images separately in the JPM (which what we do in the PDFs too, at little overhead), but JPM support is non existing in almost all the tooling as far as I can tell, making it not a great thing to target.

Dec 08 '21 00:12 MerlijnWajer

@rmast The option is available with Scantailor Advanced, which is still overall better than ScanTailor Universal. https://github.com/4lex4/scantailor-advanced/releases

I used Scantailor Advanced's Picture Shape -> Rectangular, Sensitivity (%): 100% and checked Higher Search Sensitivity. I haven't looked at the code yet. It worked well perhaps because the scan was already cleaned up. But a similar way to detect image regions, then lowering the compression for those regions could be a safe, across-the-board solution for large-scale automated tasks like @MerlijnWajer has in mind.

It also has a Splitting box when outputting in Mixed mode, and that's where you get the files.

PDFbeads performs the splitting separately, though based on Scantailor mixed output (though IIRC it can do something on its own with some options).

@MerlijnWajer Do you happen to be aware of any existing tooling that can do one of the things that pdfbeads does, namely make the JBIG2 a transparent layer in the PDF (I guess that's what happens, then the downscaled image gets underlaid), but from a specific JBIG2 tiff input?

Dec 08 '21 00:12 fusefib

@MerlijnWajer Do you happen to be aware of any existing tooling that can do one of the things that pdfbeads does, namely make the JBIG2 a transparent layer in the PDF (I guess that's what happens, then the downscaled image gets underlaid), but from a specific JBIG2 tiff input?

Not sure if I follow. The way archive-pdf-tools works, overly simplified:

Load image, create MRC components (mask, foreground, background)
Compress foreground, background as JPEG2000 (after "optimising" them), Compress mask as JBIG2 or CCITT
Paste background in the PDF page.
Paste foreground in the PDF page, over the background image, with the mask as alpha (transparency) layer.

This is actually visible if your computer is sufficiently slow: first the background image will finish decompressing, at which point you will see it, and only later the "text" (foreground) layer appears.

So it sounds like it does what you're suggesting, right?

Dec 08 '21 00:12 MerlijnWajer

Or rather, adding an image with JBIG2 as transparency layer is what it does already -- so we have code that can do it, iiuc.

Dec 08 '21 00:12 MerlijnWajer

This discussion is popped up in my notifications, and I'm not sure if this is relevant, but I would like to note that ScanTailor is a some kind of semi-automatic text-to-image segmentator. And by default it outputs a single image. But it seems 12 years ago the author have decided to reserve pure white (0x??FFFFFF) and pure black (0x??000000) to the text parts and there is no such colors in illustration parts of the result. I mean if scantailor treats something as illustration - the variability of colors of its pixels is limited to all colors except two. And you can't expect to find pure black or pure white pixels there. It seems the "export" functionality was introduced 7 years ago in ScanTailor Featured and it was adapted as legacy by the currently active forks - Advanced and Featured. But basically it just reads the output image pixel by pixel and writes it to the two different image files: one b/w only, one with everything except it. We call them "layers" but that's just a reference to a so called "methods of a separate layers" - an approach of assembling a DjVu documents bypassing the fact that opensource djvu encoders lacks text-to-image segmentators at all and the commercial one can make mistakes. One of the output files gets a ".sep" suffix and such pair of imeges is designed to be used with "DjVu Imager" application. The idea is to encode (with commercial or opensource encoder) the bundled b/w DjVu document and later automatically insert the illustrations into it with DjVu Imager by matching the ".sep" files to the corresponding pages by filenames. So you don't need to rely on automatic segmentation at all. Which gives you a best looking illustrations. The ScanTailor versions (I guess... all of them..) that reserve b/w colors to a text part in outputted images could be identified by the presence of the reserveBlackAndWhite function in the sourcecode. I mean, one don't need the export functionality if he can read the ST processed image pixel by pixel. The "export" could be done on the fly.

Dec 08 '21 02:12 trufanov-nok

So one of the issues with background pictures containing fuzz behind the foreground is not possible with this reserveBlackAndWhite output. I don't think the surrounding pixels of that reservedBlackAndWhite are cleaned up by ScanTailor as is documented with DjVu, by meeting vectors of gradients in the original picture only extending the background-vector.

Dec 08 '21 07:12 rmast

If we have a good idea of what is text and what is photo, we can attempt to use JPEG2000 Region of Interest encoding, and we will also have the mask exclude any/all parts of the photo. Then we can encode the photo as part of the background, and try to get higher quality at the regions where we think we have photos. openjpeg/grok has some form of ROI and kakadu also has -roi in combination with Rshift/Rweight/Rlevels. I haven't gotten this to work in the past, but maybe we ought to re-try when we have ocr_photo support in hOCR files.

This is relevant to my interests. I'm also curious to see if ROI compressing a scan to JP2 using the mask recode_pdf generates yields a good image. Just using that single image in the pdf as mixed forground/background without a mask may be an interesting middleground where we don't have to use a (relatively slow) jbig2 mask, but still have a better compression ratio than classic jpeg compressed PDFs.

If it helps, using a mask image with kakadu is discussed briefly in Advanced JPEG 2000 image processing techniques:

kdu_compress -i image.ppm -no_weights -o image.jp2 -precise -rate - \
  Cblk={32,32} Ckernels=W9X7 Clayers=12 Clevels=5 Creversible=no Cycc=yes \
  Rweight=16 Rlevels=5 -roi mask.pgm,0.5
This example compresses a color image losslessly using the ROI "Significance Weighting" method, using an image mask to specify the ROI. (...) [The] distortion cost function which drives the layer formation algorithm is modulated by region characteristics.

Reid, J. (2003). Advanced JPEG 2000 image processing techniques. Proceedings of SPIE, 5203(1), 223-224.

Feb 20 '22 17:02 Redsandro

@Redsandro - right, please feel free to try and toy around with kdu_compress ROI encoding. I have not added an option to dump all images as lossless (say png or tiff) before encoding as a debug feature, but I could add that if you plan to toy with it. I had very limited luck trying to use the kakadu ROI encoding, but I might have done it wrong.

Feb 20 '22 19:02 MerlijnWajer

I don’t think a 1:1 mask generated from a binarized picture would reveal regions of interest. Most of the page is just fuzzy black or fuzzy white. A page with a fuzzy signature could probably benefit from this ROI for the signature. Question is whether recognition of those ROI spots can be automated or needs a manual activity.

Feb 20 '22 20:02 rmast

If you can get ROI encoding working in Kakadu, I can add support for the hOCR ocr_photo element, which Tesseract can now do: https://github.com/tesseract-ocr/tesseract/pull/3710 - that's probably a good start, it won't help with comics in particular I suspect.

Feb 20 '22 20:02 MerlijnWajer

@rmast commented:

please feel free to try and toy around with kdu_compress ROI encoding.

After toying around, I observe the Rweight and Rlevels control the deviation from the compression based on rate. The number after the mask specifies the baseline value in the mask between 0 (black) and 1 (white).

kdu_compress -i in.tif -o out.jp2 -no_weights -rate 0.5 Creversible=no Rweight=16 Rlevels=2 -roi mask.pgm,0.5

mask.pgm

in.tif

By default the ROI mask consists of 128x128 pixel patches.

out-default.jp2

To make a more accurate mask, you need to set Cblk, e.g.: Cblk={16,16}. (You probably need to escape that with your shell, e.g. Cblk=\{16,16\})

out-16.jp2

The thing to keep in mind though is that setting a lower Cblk makes the entire code block smaller causing less efficient encoding and a bigger file, but not by a lot if you keep it sane and don't go below 8x8. The default was 64x64 in an older version and is now 128x128. Perhaps since no one is using ROI anyway it was worth saving those extra kb's, so mask accuracy comes at a cost. Maybe worth it, but that would require some experimenting on the relevant data. Perhaps using this experiment as a starting ground, you can get better results.

If you want to know more about flags, this is helpful, although some of the defaults are different on my build/machine.

Feb 25 '22 01:02 Redsandro

Hey, looks like you actually got it to work. That's great. I'll try to look at how we can use/integrate this to compress better (accuracy / size).

Feb 25 '22 08:02 MerlijnWajer

When I think of a way to get the PostNL-bill compressed that I used before as a test-subject I could imagine to use the high-density part for the square ocr_photo-frame around the logo as ROI. The grey dithered drawings on the bottom are not found as ocr_photo by tesseract, so they would just end up grayish outside the ROI. That way, using the ROI in the backgroundpicture I would expect the ROI to keep some quality for the logo on the background. I'm curious whether this would give a better quality/compression.

Mar 06 '22 22:03 rmast

@Redsandro -JFYI I still plan to work on this, I just had a long work trip and am only just coming back to this, and these kinds of improvements are more or less spare-time projects. Maybe in a few days I can make a branch with this integrated.

One thing we'll need is some testing framework to do comparisons (PSNR, SSIM, etc). I think I made a start with that, so we could compare to see how well ROI helps with compression ratios and quality.

May 04 '22 18:05 MerlijnWajer

JFYI I still plan to work on this, I just had a long work trip and am only just coming back to this, and these kinds of improvements are more or less spare-time projects.

No problem. I understand. You may want to manually try what you had in mind initially to see if it is roughly as useful as we hope. Because if the quality over compression ratio is really not close to interesting when keeping in mind the code block size limitations, it would be a waste to set up a lot of scaffolding.

May 04 '22 19:05 Redsandro

The grey ABN AMRO-text on top of the ABN-AMRO-letter is recognized by tesseract as text size 75 in a bounding box. The shield-logo to the left appears to be recognized as an apostrophe in a bounding box. Would you use the ROI detail for compressing these bounding-boxes in JPEG2000, or would you still use JBIG2 to make sharp mask boundaries to a rough color picture? Those letters are just grey on white, the shield-logo is green/yellow on white.

May 04 '22 20:05 rmast

I figured we'd still use the JBIG2, and just get more quality for the parts that we care about. We could see how it works without the mask, but I'm a bit sceptical.

May 04 '22 21:05 MerlijnWajer

So all text will be masked by JBIG2 colored by a low res mask coloring picture and photo-elements will get ROI attention on the background picture. Usually that means text 300 dpi, background picture 100 dpi?

May 05 '22 06:05 rmast

I have pushed some code here: https://github.com/internetarchive/archive-pdf-tools/tree/roi

The wheels will end up here: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2276084863

I can run it like so (for testing purposes):

recode_pdf -v --bg-compression-flags '-no_weights -rate 0.005 Cblk={16,16} Creversible=no Rweight=16 Rlevels=2' --fg-compression-flags '-no_weights -rate 0.075 Cblk={16,16} Creversible=no Rweight=16 Rlevels=2' --dpi 300 -J kakadu -I /tmp/in.png --hocr-file /tmp/in.hocr -o /tmp/out-roi.pdf

ROI mode is currently enabled when "Creversible=no" is found in the flags (literally) - and that is a hack, I know.

The background seems to improve with the mask, the foreground not so much? (For the background, we use the inverted mask) - I hope I didn't swap the inverted-ness for background/foreground.

With the above parameters the size is about the same as without roi and default slope values.

Maybe give it a try?

May 05 '22 13:05 MerlijnWajer

I have also pushed a commit where I swapped the revertness, which will build separately as an action: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2276139177

I also changed the rate from 0.005 to 0.01 for bg compression flags locally, so you might want to update my command line for that, I think it is a more fair comparison. Here is my command with downsampling added in:

recode_pdf -v --bg-compression-flags '-no_weights -rate 0.01 Cblk={8,8} Creversible=no Rweight=16 Rlevels=2' --fg-compression-flags '-no_weights -rate 0.075 Cblk={8,8} Creversible=no Rweight=16 Rlevels=2' --dpi 300 -J kakadu -I /tmp/in.png --hocr-file /tmp/in.hocr -o /tmp/out-roi.pdf --bg-downsample 3

vs the 'normal':

recode_pdf --bg-downsample 3 -v --dpi 300 -J kakadu -I /tmp/in.png --hocr-file /tmp/in.hocr -o /tmp/out.pdf

The background definitely looks less noisy.

May 05 '22 14:05 MerlijnWajer

It might even also make sense to use different Cblk values for the background and the foreground, I can imagine. For the background we maybe don't need the regions to be that small, but for the foreground we likely do want that.

May 05 '22 14:05 MerlijnWajer

roi-diff

Definitely seems to make a difference for background noise...

May 05 '22 17:05 MerlijnWajer

I think in general this looks like it can offer an improvement, but I'll need to think how it can be integrated properly, maybe it's time to offer some encoding "profiles", so that people can pick without having to fiddle with the exact OpenJPEG flags, or kakadu rates, etc...

May 05 '22 17:05 MerlijnWajer

I experimented with didjvu via c44 a while ago. https://github.com/jwilk/didjvu/issues/19 With subsample ratios 3 to 5 for the background picture c44 was able to almost clear the background (I guess by using the patented vector estimation to filter out the surrounding fuzz resulting by partial-pixels with a color between the fore and background). The patent has expired and c44 is open source. So to clear the background fuzz there might be another option.

May 05 '22 17:05 rmast

@rmast - could you share some command lines to go from a tif/png/pgm/etc background image (before I "optimse" them, but after "removing" the foreground) to a djvu component which is then converted back to png? That would ease testing.

May 05 '22 21:05 MerlijnWajer

archive-pdf-tools archive-pdf-tools copied to clipboard

Usefulness of MRC for decent quality compression of scanned book pages with illustrations

archive-pdf-tools
archive-pdf-tools copied to clipboard