OCRmyPDF redo-ocr introduces spaces in words

redo-ocr introduces spaces in words

Open navtis opened this issue 3 years ago • 10 comments

Context:

I am processing a set of PDFs which have previously been (poorly) OCR-ed. After processing with OCRmyPDF I am using pdftotext to extract and examine the text (the problem I am seeing is not introduced by pdftotext as I get the same text by copying and pasting from the PDF in evince).

Symptoms:

If I use --force-ocr the text output is reasonable, though with occasional additional carriage returns and words run together. Unfortunately the new PDF is much greater in size than the original.

If I use --redo-ocr the output PDF is slightly smaller than the original, and there are fewer additional carriage returns and run together words, but the output text has many words with internal spaces, to the extent that it is unreadable.

What I would expect

No difference between the ocr output with the two options

Example:

1. force-ocr `It will be seen that the North 1s makingan imperative call upon him and that he answers with whole-hearted eagerness, for the time giving himself up almost entirely to the

delight of this new interest, coming face to face with the

Northern literature that until now he had but known in

translations and abstracts; his mind

is in a ferment with

all this fresh material urging him to fresh production; and

“The Earthly Paradise” work goes on steadily, while the business claims his close personal attention.`

2. redo-ocr (the same stretch of text): ``It w ill b e se e n th a t th e N o r th is m a k in g a n im p era tiv e call u p o n h im an d th at h e a n sw ers w ith w h o le -h e a r te d ea ^^C^D n e ss, fo r th e tim e g iv in g h im s e lf u p a lm o st e n tir e ly to th e d e lig h t o f th is n ew in te r e st, c o m in g face to face w ith th e N o r th e r n litera tu re th at u n til n o w h e had b u t k n o w n in tra n sla tio n s and a b str a cts; his m in d is in a fe r m e n t w ith all th is fresh m aterial u r g in g h im to fresh p r o d u c tio n ; and “ T h e E a r th ly P a r a d ise ” w o r k g o e s o n ste a d ily , w h ile th e b u sin e ss cla im s h is c lo s e p erson al a tte n tio n

OCRmyPDF with neither parameter set, run on PDFs from the same source (books in the same edition of the same series scanned at the same time with the same equipment) which have not been OCR-ed produces text similar in quality to the ouput using --force-ocr, but without so many joined together words.

I can see two possibilities: finding a way to get redo-ocr to produce better output, or finding a way to remove the existing embedded text and then run OCRmyPDF with neither parameter. Unfortunately I haven't found a way to remove pre-existing ocr embedded text (I am on Linux and only have access to free software right now).

System OS: Linux 5.4.80-gentoo-r1 Python: 3.8.7 OCRmyPDF: 11.6.2 tesseract 5.0.0-alpha-152-g17c8a leptonica-1.74.4 libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.3) : libpng 1.6.37+apng : libtiff 4.2.0 : zlib 1.2.11

Installation Using portage and ebuild from https://gentoo.zugaina.org/Search?search=ocrmypdf

Feb 20 '21 00:02 navtis

Ooh, interesting. Thanks for doing this investigation.

Can you share the PDF? This is very likely going to have to do with details of how the previous PDF is formatted.

If you are concerned about sharing the file publicly, you can encrypt it with my public key as described here: https://github.com/jbarlow83/OCRmyPDF/wiki

Feb 20 '21 00:02 jbarlow83

This is one of a set scanned by the Russian National Library in 2009 and ocr-ed by them some time after (actually the original ocr-ed text in this one isn't too bad, but it is the one I used for the extract in my original report). It is public domain (distributed by the library itself), though there is nothing in it to say so.

The_collected_works_of_William_Morris_vol05.pdf

Feb 20 '21 10:02 navtis

This is not a fix, but a workaround: the original OCR can be removed by using pdfimages [note: clarified below] to break the PDF into page images; then new OCR created by piping the output of img2pdf into ocrmypdf with no parameters. The quality of the OCR output in this case is very slightly better tnan that of force-ocr (in the extract above, all the same clumping of words but fewer unwanted carriage returns, and the '1s' is correctly read as 'is').

Feb 21 '21 18:02 navtis

Re the workaround: ~~There are multiple programs called "pdftoimages" based on my search.~~ You do get better quality by extracting images from PDFs and applying OCR to those, but that works if and only if the PDF is what we call "a bag of images PDF": one image per page drawn exactly over the whole page with no funny business. This approach will fail if the images are used in any complex way, such as with masking, stenciling, transparency, fancy color tables or any of about 1000 other tricky things a PDF could do - so ocrmypdf rasterizes whole page images using Ghostscript and then OCRs the result. That way, the PDF actually gets interpreted similar to how a PDF viewer would interpret it. Any minor variance in the output quality from two PDF rasterizations is just random noise.

For the file itself, OCRmyPDF cannot detect the way that the particular way OCR is embedded in this file. The result is that it covers up the vast majority of text in --redo-ocr, thinking that the OCR is visible text.

Here is what it sends to Tesseract with --redo-ocr for page 6, the one you quoted: 000001_ocr

The additional spacing you found seems to be Ghostscript's fault (via --output-type pdfa). For reasons unclear to me, Ghostscript adds these extra spaces in PDF/A conversion. Extra spaces are a hard problem in PDF generator, owing to its heritage as a printer output language (spaces don't exist, so you have to heuristically infer them, and PDF viewers follow standards worse than web browsers did when IE 6 was king). There's also no official to insert "OCR text"; instead there are several conventions.

The bottom line, is for this file you will get best results with --force-ocr --output-type pdf; and time permitting I will need to look into options to detect more ways of inserting OCR.

Feb 22 '21 07:02 jbarlow83

Thank you for the time you have spent looking at this. I have put in a bug report to ghostscript:

https://bugs.ghostscript.com/show_bug.cgi?id=703591

Re my 'workaround': I used 'pdfimages' from poppler, not 'pdftoimages' as I mistakenly wrote. I was too focussed on looking at the OCR output, and had not noticed that in fact there are a few embedded images in the original PDF, which were of course not reassembled into the PDF correctly when the images were piped back into ocrmypdf.

Feb 22 '21 11:02 navtis

Following investigation by the ghostview developer, it looks like the root cause is probably not having the same fonts originally used in generating the ocr-ed file available on my system, the difference in width of the substitute fonts causing the problem. It is odd that extracting the text from the PDF/A using poppler's pdftotext shows spaces between almost all letters; doing it with Acroread reconstitutes much of the text, and ghostscript's own 'txtwrite' option manages to completely recover the text, with no surplus spaces. But Linux viewers such as evince fail to do this, making search unusable, so for my purposes - since I don't have access to many fonts - I'll have to give up on using PDF/A. 'output-type pdf' from now on!

Feb 23 '21 20:02 navtis

I have this same thing happen all the time when I redo the OCR on a PDF by removing the text layer and using ocrmypdf --force-ocr on the file, then overlaying just the text layer back onto the original file (I use --output-type pdf not pdfa), or even when I use tesseract to OCR some image files and put them into a PDF.

What's interesting is that while I usually use Adobe Reader since Preview.app doesn't always play nice with PDF files, the two apps can highlight very differently (and Preview does a better job). (For a while there Preview was blowing up the size of some PDFs, but that might have stopped.)

For example, Reader will highlight a sentence, leaving out the spaces between the words, while Preview will highlight everything. The gaps in the Reader highlight get rendered as line breaks if I copy the text, while the Preview highlight has spaces. This screen cap has a Preview highlight up top, then a Reader one.

What's most odd to me is that Reader will copy the text of a Preview highlight with the spaces, even though its own highlighting would yield the gaps with new lines.

A third reader I use commonly is PDF Viewer on my iPad. It's worse than either of the other two about leaving spaces between words.

I've been assuming that's something inherent in the readers' software, as you mention above, but perhaps there's something ocrmypdf can do differently?

PS I find --redo-ocr doesn't always work well. On this file, I get visible OCR. https://www.jstor.org/stable/pdf/30103314.pdf

Mar 20 '21 19:03 Jmuccigr

Incorrect spacing between letters/words is a longstanding problem in Tesseract and I've worked on it over there. It has to do with PDF being a print production file format, with no concept of a word in the core format. There are just glyphs draw at locations, and determining which ones are words is a matter of interpretation. (There are ways of adding markup to fix this in part, but many PDF viewers don't support it.)

Preview is generally the least cooperative PDF viewer. There is an improvement PR in Tesseract that makes this better on all viewers except Preview, where it introduces major regressions.

Jun 08 '21 20:06 jbarlow83

Incorrect spacing between letters/words is a longstanding problem in Tesseract and I've worked on it over there. It has to do with PDF being a print production file format, with no concept of a word in the core format. There are just glyphs draw at locations, and determining which ones are words is a matter of interpretation. (There are ways of adding markup to fix this in part, but many PDF viewers don't support it.)

Preview is generally the least cooperative PDF viewer. There is an improvement PR in Tesseract that makes this better on all viewers except Preview, where it introduces major regressions.

Yeah, I think I remember you talking about inserting non-breaking spaces to force the word breaks to work correctly.

I should add that PDF Viewer on iOS works much better now and gives more satisfactory results. I'm still using Adobe Reader even though it's interface is crap.

Jun 08 '21 22:06 Jmuccigr

No news on this? I installed paperless and its making all of my pdfs useless, because of additional spaces between nearly all letters. So i thought its a problem with paperless, but then i thought it is a problem with ocrmypdf, but finaly with tesseract? So this is a major problem with no solution at all making all products using ocrmypdf unusable.

Mar 05 '22 08:03 tpre

OCRmyPDF OCRmyPDF copied to clipboard

redo-ocr introduces spaces in words

OCRmyPDF
OCRmyPDF copied to clipboard