pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont

Open zailushang2006 opened this issue 2 months ago • 2 comments

I need to extract text from a PDF document using the page.extract_text function, but all the extracted Chinese characters are garbled. I suspect that this PDF document uses several special Chinese fonts: /TJQCZS+FzBookMaker2DlFont. I used debug to examine the source code of PyPDF, and in the /Font->/Encoding->/Differences mapping table, characters are mapped to special encodings as follows:

{'/Differences': [35, '/G23', 36, '/G24', 37, '/G25', 38, '/G26', 39, '/G27', 40, '/G28', 41, '/G29', 42, '/G2A', 43, '/G2B', 44, '/G2C', 45, '/G2D', 46, '/G2E', 47, '/G2F', 48, '/G30', 49, '/G31'], '/Type': '/Encoding'}

The font file is decoded using the specified /Filter: /FlateDecode under /Font->/FontDescriptor->/FontFile3, but the font file is garbled.

Since Adobe Acrobat can display the text correctly, there must be another way to handle this. I am not very familiar with the structure and protocols of PDF documents, so I am unsure how to resolve this issue.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('cryptography', '42.0.2'), PIL=10.2.0

Code + PDFex

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader(pdf_path)

number_of_pages = len(reader.pages)
print(f"Number of pages: {number_of_pages}")
for i in range(number_of_pages):
    if i != 3:
        continue
    page = reader.pages[i]

    text = page.extract_text()
    print(text[:5000])

Share here the PDF file(s) that cause the issue. GB+15322.2-2019.pdf

Traceback

This is the complete traceback I see:

page 3 (start 0):

84971221-CBF2-46dc-B435-6ADF2271A1D4

print result:

686E886A-E4B7-4bb5-9BAC-05A609334090

zailushang2006 avatar Apr 22 '24 08:04 zailushang2006

The fact that Adobe is able to display glyphs (images or drawings) does not mean it can associate them with some characters. copy paste using acrobat reader, pdf.JS (firefox) or PDFium (chrome) does not provide results. I strongly doubt, there is an easy way to extract data. My only approach would be to build/print to images and then use an OCR to extract text. This is out of pypdf capabilities.

pubpub-zz avatar Apr 23 '24 21:04 pubpub-zz

As far as I have seen yesterday, pdftotext/poppler would indeed provide somehow valid results for page 4.

stefan6419846 avatar Apr 24 '24 05:04 stefan6419846