scantailor-advanced icon indicating copy to clipboard operation
scantailor-advanced copied to clipboard

Request: Improve black and white output quality

Open rudolphos opened this issue 3 years ago • 2 comments

I'm only using ScanTailor for deskewing and margins as those are the best features this software has to offer, but most of the time black-and-white text output quality is bad and too noisy, it removes too much from the original and creates some parts of a character missing.

Original image

This is how an output looks in ABBYY FineReader using just 'Whiten background' feature with no other options. I suspect it just uses grayscale and posterize? Since one page tif export is around the same size (80 KB) as on ST. image

Here's ScanTailor output with default settings image

Savitzky, Morph. disabled: image

ABBYY had the result most similar to the original text.

rudolphos avatar Mar 13 '21 01:03 rudolphos

You're right that the ScanTailor output is not exactly as precise and smooth as the ABBY output, but overall it seems fairly similar to me.

mara004 avatar Mar 17 '21 10:03 mara004

I've always wanted to delve deeper into how DjVu does this (see the sixth example here). Meanwhile, when I want to remove the background from a dirty scanned page, I do the following in GIMP:

  1. Open an image and duplicate its single layer
  2. Estimate the background color by applying a median blur filter to the new layer with a percentile above 50% (usually 70%) adjusting the radius to the content (usually 80px at 300dpi or more if there are large illustrations, until the preview has no blobs related to actual content)
  3. Remove the estimated background inverting the colors of the new layer, setting the layer mode to Addition (Legacy) and merging it with the original layer

As this requires previewing to determine the best blur radius, it cannot be applied automatically without risking destroying some content. But a manual adjustment may be compatible with ScanTailor's UI. This works great with these "default" settings for text-only documents with no graphics other than thin lines.

Sometimes even a very large radius can still leave some content-related blobs. In these cases, between steps 2 and 3, I select the remaining blobs and repeat the median blurring to reduce them further, which fills them with the surrounding background color, and then I apply a bit of Gaussian blur blurring to soften the edges of the image. selection. But all this manual work may not suit ScanTailor's UI.

After removing the background, I apply GIMP White Balance to make the text very readable.

ftrebien avatar Apr 22 '24 20:04 ftrebien