OCRmyPDF
OCRmyPDF copied to clipboard
Under `--redo-ocr`, a fragment of original OCR text was made visible in output
Describe the bug
Using OCRmyPDF to redo OCR on a scanned document, some parts of the original erroneous OCR text became visible in the output pdf. (Most of the original OCR text was stripped, as expected.)
To Reproduce
The command was run with:
ocrmypdf --redo-ocr -l eng+kat Basic\ Georgian\ lessons\ 6–7.pdf l67-redoocr.pdf
I can attach full verbose log output if desired; I was unable to pick out anything useful in the log.
Example file
Full input pdf can be found here. It’s GPG encrypted with the maintainer’s key, and shouldn’t be publicly redistributed.
The unexpected output occurs in the last paragraphs of the 1st and 3rd pages.
Expected behavior
The original input pdf was a scanned document produced by Adobe Scan for iPhone. It had a mostly-erroneous OCR layer, since most of the document is in Georgian (i.e. in the Georgian script, ქართული ენა), but the Adobe OCR had (it seems) attempted to read it all as English text.
Running ocrmypdf --redo-ocr -l eng+kat
, my expectation was to remove the original OCR layer, and re-OCR in English and Georgian, without altering anything visible in the document.
This was successful on most of the document, but in just a few paragraphs, the original (erroneous) OCR text was not stripped, but became visible in the output pdf.
Screenshots
A section of the the input and output where this occurs:
(from bottom of 3rd page)
System
- OS: Mac OS Catalina, 10.15.7
- OCRmyPDF Version: 11.6.3
- OCRmyPDF installed with the Homebrew package manager,
brew install ocrmypdf
.
+1 for providing a test file, interesting problem, and introducing me to an exotic script that looks like a cross between Greek and Arabic.
The problem here, which is a bug or perhaps missing feature, is that ocrmypdf can't detect this particular type of OCR when performing --redo-ocr
, so it doesn't remove the pre-existing OCR correctly. Unfortunately I don't know when I'll be able to improve this, because there are several types of OCR we'll need to identify and remove.
You should get much better results with --force-ocr
.
same here, using --redo-ocr
and -l eng+chi_sim
, result is completely unreadable