pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Missing spaces in extract_text() method

Open Sunguru opened this issue 1 year ago • 1 comments

Missing spaces in extract_text() method. See attached PDFs. Text is being extracted nice, but it comes with no spaces from almost all fields.

Environment

$ python -c "import pypdf;print(pypdf.__version__)"
pypdf==3.14.0

Code + PDF

PDF: 0004.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("0004.pdf")

page = reader.pages[0]
extracted = page.extract_text().split("Description:")[1].split("8/11/22")[0]
print(extracted)

gives:

 Reportingcrudeoilleak.
Leakwasisolatedtowell
pad.Segmentoflinewas
immediatelyisolated,now
estimatedat5barrelsofoil
spilt.Rootcausestill
unknownatthistime.

expected (copy-pasted with Google chrome):

Reporting crude oil leak.
Leak was isolated to well
pad. Segment of line was
immediately isolated, now
estimated at 5 barrels of oil
spilt. Root cause still
unknown at this time.

0000.pdf

Yes, you may add to the tests. It is public data from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspx

p,s, Thank you for the great package!

Sunguru avatar Sep 06 '22 15:09 Sunguru