pypdf Space regression by PR 1172

Space regression by PR 1172

Open MartinThoma opened this issue 1 year ago • 1 comments

I've just noticed that PR #1172 introduced a space regression issue for text extraction. A lot of spaces got removed. Those spaces should have stayed.

Code + PDF

Just standard text extraction:

from PyPDF2 import PdfReader

reader = PdfReader("doc.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

PDFs:

https://arxiv.org/pdf/2201.00029.pdf - here it's very obvious
https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf (German doc) - here it happens mostly with mathematical formula missing space to the surrounding text. That is a pattern I've seen in many of the other documents as well.

See https://arxiv.org/pdf/2201.00029.pdf :

Sep 24 '22 04:09 MartinThoma

@pubpub-zz Would you mind to have a look? It's not critical, but you are definitely the expert on that topic :-)

Sep 24 '22 04:09 MartinThoma

pypdf pypdf copied to clipboard

Space regression by PR 1172

Code + PDF

pypdf
pypdf copied to clipboard