pypdf
pypdf copied to clipboard
Space regression by PR 1172
I've just noticed that PR #1172 introduced a space regression issue for text extraction. A lot of spaces got removed. Those spaces should have stayed.
Code + PDF
Just standard text extraction:
from PyPDF2 import PdfReader
reader = PdfReader("doc.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
PDFs:
- https://arxiv.org/pdf/2201.00029.pdf - here it's very obvious
- https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf (German doc) - here it happens mostly with mathematical formula missing space to the surrounding text. That is a pattern I've seen in many of the other documents as well.
See https://arxiv.org/pdf/2201.00029.pdf :
@pubpub-zz Would you mind to have a look? It's not critical, but you are definitely the expert on that topic :-)