Better OCR Overlay

Open pschopen opened this issue 11 months ago • 2 comments

NAPS is one of the fastest and most convenient tools I know for optimising and OCRing PDFs. However, often the OCRed text doesn't match the text in the underlying image.

This, on the other hand, works quite well with this tool: https://github.com/UB-Mannheim/zotero-ocr Maybe it makes sense to use the same approach? I don't know the technical details, it's just an idea.

Feb 03 '25 16:02 pschopen

Do you have an example PDF where the text doesn't match up?

Mar 29 '25 02:03 cyanfish

It seems to happen when the document uses different fonts. In the following example, the text is correct, but the heading is not.

OCR Overlay.pdf

Edit: I've just noticed that the text doesn't overlap ideally either. There is always space between the words.

Apr 03 '25 08:04 pschopen