pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

New line character missing and URLs adding periods and space

Open AlexNguyen124 opened this issue 11 months ago • 1 comments

2 issues to report. Not sure if these are bugs or feature.

First, often, end of line words are concatenated with begining of next line words. For example: I used pypdf on the following PDF (but the same occurs in other PDF) https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/corporate-sustainability/2021-cs-report.pdf https://s2.q4cdn.com/470004039/files/doc_downloads/2022/08/2022_Apple_ESG_Report.pdf

In the first few lines of the output we see:

Citizens of the World
2021 Corporate Sustainability ReportCitizens of the World 2021
Corporate Sustainability Report 2Contents
INTRODUCTION  3
 —About our report 3
• Reporting framework 4
• Third-party assurance 4
 —Corporate sustainability at Air Canada 5

Immediately, there a few inaccuracies:

  • 2nd line: "Report" and "Citizens" should be separated
  • 3rd line "2" and "Contents"

The page we are trying to convert has many columns and I suspect there is missing a newline character.

Second Space are added to urls. Consider what I have found in the output: "www. aircanada. com/ citizensoftheworld"

I hope this helps.

Environment

Google Colab

doc = PdfReader(path_to_pdf)
text = ""
path_to_txt = os.path.join(txt_path, "pypdf", fname) + ".txt"
print(path_to_txt)
for page in doc.pages:
    text += page.extract_text()
out = open(path_to_txt, "w")  # create a text output
out.write(text)
out.close()

AlexNguyen124 avatar Jul 17 '23 23:07 AlexNguyen124

All https://github.com/py-pdf/pypdf/labels/whitespace issues are notoriously hard to deal with. This might not get resolved any time soon (or not at all).

MartinThoma avatar Jul 18 '23 10:07 MartinThoma