pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

extract_text produces hexadecimal output

Open staff0rd opened this issue 1 year ago • 2 comments

The below code results in what looks like a bunch of hexadecimal. The first page of the pdf is displayed below, I note that I can copy/paste text normally from it (via Google Chrome).

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.4, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
pdfreader = PdfReader('kia-stonic-owners-manual-my23.pdf')
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

# write text to file
with open('text.txt', 'w') as f:
    f.write(raw_text)

Share here the PDF file(s) that cause the issue: kia-stonic-owners-manual-my23.pdf

First page of pdf

image

top of text.txt

image

staff0rd avatar Jan 16 '24 01:01 staff0rd

Did you find a workaround for this?

IshmamR avatar Mar 18 '24 18:03 IshmamR

the fonts in the PDF have no tounicode mapping which is the standard way to get translation for text extraction. without such information pypdf uses the codes. Personally, I've not been able yet to identify a way to get a unicode from the font

pubpub-zz avatar Mar 18 '24 21:03 pubpub-zz

Without feedbacks I close this issue as out of pypdf capabilities

pubpub-zz avatar Aug 14 '24 18:08 pubpub-zz