OCRmyPDF Under `--redo-ocr`, a fragment of original OCR text was made visible in output

Under `--redo-ocr`, a fragment of original OCR text was made visible in output

Open peterlefanulumsdaine opened this issue 4 years ago • 2 comments

Describe the bug

Using OCRmyPDF to redo OCR on a scanned document, some parts of the original erroneous OCR text became visible in the output pdf. (Most of the original OCR text was stripped, as expected.)

To Reproduce

The command was run with:

ocrmypdf --redo-ocr -l eng+kat Basic\ Georgian\ lessons\ 6–7.pdf l67-redoocr.pdf

I can attach full verbose log output if desired; I was unable to pick out anything useful in the log.

Example file

Full input pdf can be found here. It’s GPG encrypted with the maintainer’s key, and shouldn’t be publicly redistributed.

The unexpected output occurs in the last paragraphs of the 1st and 3rd pages.

Expected behavior

The original input pdf was a scanned document produced by Adobe Scan for iPhone. It had a mostly-erroneous OCR layer, since most of the document is in Georgian (i.e. in the Georgian script, ქართული ენა), but the Adobe OCR had (it seems) attempted to read it all as English text.

Running ocrmypdf --redo-ocr -l eng+kat, my expectation was to remove the original OCR layer, and re-OCR in English and Georgian, without altering anything visible in the document.

This was successful on most of the document, but in just a few paragraphs, the original (erroneous) OCR text was not stripped, but became visible in the output pdf.

Screenshots

A section of the the input and output where this occurs: Input image Output image

(from bottom of 3rd page)

System

OS: Mac OS Catalina, 10.15.7
OCRmyPDF Version: 11.6.3
OCRmyPDF installed with the Homebrew package manager, brew install ocrmypdf.

Feb 17 '21 10:02 peterlefanulumsdaine

+1 for providing a test file, interesting problem, and introducing me to an exotic script that looks like a cross between Greek and Arabic.

The problem here, which is a bug or perhaps missing feature, is that ocrmypdf can't detect this particular type of OCR when performing --redo-ocr, so it doesn't remove the pre-existing OCR correctly. Unfortunately I don't know when I'll be able to improve this, because there are several types of OCR we'll need to identify and remove.

You should get much better results with --force-ocr.

Feb 23 '21 08:02 jbarlow83

same here, using --redo-ocr and -l eng+chi_sim, result is completely unreadable

Dec 28 '21 08:12 SYQsb

OCRmyPDF OCRmyPDF copied to clipboard

Under `--redo-ocr`, a fragment of original OCR text was made visible in output

OCRmyPDF
OCRmyPDF copied to clipboard