pypdf
pypdf copied to clipboard
Missing spaces in extract_text() method
Missing spaces in extract_text() method. See attached PDFs. Text is being extracted nice, but it comes with no spaces from almost all fields.
Environment
$ python -c "import pypdf;print(pypdf.__version__)"
pypdf==3.14.0
Code + PDF
PDF: 0004.pdf
from pypdf import PdfReader, __version__
print(f"pypdf=={__version__}")
reader = PdfReader("0004.pdf")
page = reader.pages[0]
extracted = page.extract_text().split("Description:")[1].split("8/11/22")[0]
print(extracted)
gives:
Reportingcrudeoilleak.
Leakwasisolatedtowell
pad.Segmentoflinewas
immediatelyisolated,now
estimatedat5barrelsofoil
spilt.Rootcausestill
unknownatthistime.
expected (copy-pasted with Google chrome):
Reporting crude oil leak.
Leak was isolated to well
pad. Segment of line was
immediately isolated, now
estimated at 5 barrels of oil
spilt. Root cause still
unknown at this time.
Yes, you may add to the tests. It is public data
from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspx
p,s, Thank you for the great package!