paperless-ng
paperless-ng copied to clipboard
[BUG] pdfminer defaults cause excessive whitespaces in extracted text
I ran into the same problem as #1679 when processing PDFs that had been OCR'ed with Abbyocr already: spaces between individual letters.
The issue in my case was pdfminer's default laparams, especially word_margin's default of 0.1:
>>> from pdfminer.high_level import extract_text as pdfminer_extract_text
>>> pdfminer_extract_text("0000131.pdf")
'e S T A D T W E R K E\n\nxx\n\nV e r t r a g s k o n t o - N r . :[...]
Changing word_margin=1 fixed it for me, but I'm not sure if it's universally good. (I've tried various margin values; 1.0 seems to be the smallest that worked well for me.)
>>> import pdfminer
>>> laparm = pdfminer.layout.LAParams()
>>> laparm.word_margin = 1
>>> pdfminer_extract_text("0000131.pdf", laparams = laparm)
'e STADTWERKE\n\nxxx\n\nVertragskonto-Nr.:[..]'
Relevant information
- Host OS of the machine running paperless: debian
- Browser: any
- Version: "jonaswinkler/paperless-ng@sha256:b61d514e178ddfa4673e72d0440b3166d46ec977dc6bbc7a9a293adf64200f55"
- Installation method: docker