pdfannots Scan with OCR: words not split

Scan with OCR: words not split

Open dirksierd opened this issue 4 years ago • 1 comments

With this PDF-file the words are not split. It's an OCR-scan. I tried modifying the word_margin in LAParams to no avail. When exporting the highlights using PDF Expert (my macOS-PDF Reader) it works fine though: here's the expected output.

Any thoughts?

Best regards

Apr 13 '20 17:04 dirksierd

This is an issue in the pdfminer library. I confirmed that:

pdfminer's pdf2txt.py tool fails in a similar way -- no spaces and far too many chars extracted
Both my PDF reader and Poppler's pdftotext utility extract the text correctly

If you do report an issue (or find an existing one) on the pdfminer project, please link it here.

Mar 04 '21 21:03 0xabu

pdfannots pdfannots copied to clipboard

Scan with OCR: words not split

pdfannots
pdfannots copied to clipboard