pypdf Spaces (that do not exist in the original PDF) appear in the output of extract

Spaces (that do not exist in the original PDF) appear in the output of extract_text()

Open renanbirck opened this issue 6 months ago • 4 comments

I am trying to parse this PDF. However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF.

See the screenshot - the original PDF on the left, the output of for what I mean (e.g. "Av. Beir a Rio" should be "Av. Beira Rio", "Cen tro" should be "Centro"):

If I copy/paste from Okular or other PDF reader to a text document, it is copied correctly, so I know the PDF file is not broken.

Environment

I am using Python 3.12 in Fedora 39.

$ python -m platform
Linux-6.6.4-200.fc39.x86_64-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader('Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf')
text = reader.pages[0].extract_text()

Dec 09 '23 22:12 renanbirck

This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces

TL;DR: How a text layer is being retrieved depends on the actual library implementation - each tends to have its own advantages and limits. In this specific case, the pdftotext layout mode (based upon poppler, one of the standard PDF libraries for Linux systems) seems to provide "correct" results, as well as mutool convert.

Dec 10 '23 08:12 stefan6419846

This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces

I understand. Is there any way I can work around it in pypdf? Other PDF libraries (like pymupdf, based on mupdf) don't have that problem.

Dec 13 '23 15:12 renanbirck

You might want to have a look at the code from https://github.com/py-pdf/pypdf/discussions/2038#discussioncomment-7736074.

Dec 13 '23 15:12 stefan6419846

@renanbirck the extra spaces the output of the "tt" special character conversion. I don't know how to get the good output :the translation is not part of the ToUnicode field. I don't know neither how other programs are doing the translation

Apr 02 '24 19:04 pubpub-zz

pypdf pypdf copied to clipboard

Spaces (that do not exist in the original PDF) appear in the output of extract_text()

Environment

Code + PDF

pypdf
pypdf copied to clipboard