OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Text layer not aligned with original document

Open wpzdm opened this issue 6 years ago • 8 comments

Describe the issue When selecting a word, last character is always out of selection. The problem should be clear from the following snapshot: https://imgur.com/f12zGYc This problem appears on all files I tried.

Example file https://gofile.io/?c=ptsFtL

  • [x] This is the input file
  • [x] The file contains no personal or confidential information
  • [ ] I am the copyright holder for this file
  • [ ] I permit this file to be included in the OCRmyPDF test suite under the CC-BY-SA 4.0 license
  • [ ] I am not the copyright holder, but this file is available under a free software license

System:

  • OS: macOS 10.14.5
  • OCRmyPDF Version: 9.0.3

wpzdm avatar Nov 04 '19 03:11 wpzdm

The imgur link didn't work unfortunately.

Please upload a PDF. It's very unlikely I will be able to investigate this issue without an example input PDF.

jbarlow83 avatar Nov 04 '19 05:11 jbarlow83

I uploaded an input PDF. You can use --force-ocr to override the original text layer. Note that the problem is unrelated to force commend.

BTW, maybe I did not make it clear enough, the problem is in the display of selection area, and the snapshot was meant to show it (not to use as an input file). Here, I selected the word 'quantum', but as you can see from the snapshot, the character 'm' was not selected (but I can still copy the whole word with this selection).

wpzdm avatar Nov 04 '19 10:11 wpzdm

The alternative renderer --pdf-renderer hocr does a better job here. Note that it has issues with non-Latin text.

I believe this is a regression in Tesseract 4.1.0.

Related past issue: https://github.com/tesseract-ocr/tesseract/issues/1900

jbarlow83 avatar Nov 04 '19 10:11 jbarlow83

Thank you! hocr did well in this issue, but it seems worse than default renderer in detecting and separating words, particularly in the content tables.

Here is an example: https://gofile.io/?c=ZE8RQG. Compare page 13 of PDF file, hocr detected none of the chapter titles, and did not separate some words like 'Maximal inequality' and 'Kolmogorov Existence Theorem'. BTW, both renderers failed to separate all the occurrences of the phrase 'Section summary'.

Maybe I should open another issue?

wpzdm avatar Nov 05 '19 08:11 wpzdm

Inserting spaces between words is the job of the PDF viewer due to unfortunate design decisions in the early days of PDF. macOS Preview also does a particular poor job of this.

jbarlow83 avatar Nov 05 '19 10:11 jbarlow83

Thank you. I use Clearview on Mac. I don't know if it shares the same underlying with Preview.

wpzdm avatar Nov 06 '19 04:11 wpzdm

too old

jbarlow83 avatar Nov 21 '23 08:11 jbarlow83

There's a fork of Tesseract, which fixed text alignment problem in Tesseract here.

Tarek-Hasan avatar Apr 09 '24 03:04 Tarek-Hasan