paperless-ng icon indicating copy to clipboard operation
paperless-ng copied to clipboard

[BUG] pdfminer defaults cause excessive whitespaces in extracted text

Open tmbinc opened this issue 2 years ago • 0 comments

I ran into the same problem as #1679 when processing PDFs that had been OCR'ed with Abbyocr already: spaces between individual letters.

The issue in my case was pdfminer's default laparams, especially word_margin's default of 0.1:

>>> from pdfminer.high_level import extract_text as pdfminer_extract_text
>>> pdfminer_extract_text("0000131.pdf")
'e S T A D T W E R K E\n\nxx\n\nV e r t r a g s k o n t o - N r . :[...]

Changing word_margin=1 fixed it for me, but I'm not sure if it's universally good. (I've tried various margin values; 1.0 seems to be the smallest that worked well for me.)

>>> import pdfminer
>>> laparm = pdfminer.layout.LAParams()
>>> laparm.word_margin = 1
>>> pdfminer_extract_text("0000131.pdf", laparams = laparm)
'e STADTWERKE\n\nxxx\n\nVertragskonto-Nr.:[..]'

Relevant information

  • Host OS of the machine running paperless: debian
  • Browser: any
  • Version: "jonaswinkler/paperless-ng@sha256:b61d514e178ddfa4673e72d0440b3166d46ec977dc6bbc7a9a293adf64200f55"
  • Installation method: docker

tmbinc avatar Oct 02 '22 17:10 tmbinc