pypdf
pypdf copied to clipboard
Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont
I need to extract text from a PDF document using the page.extract_text
function, but all the extracted Chinese characters are garbled. I suspect that this PDF document uses several special Chinese fonts: /TJQCZS+FzBookMaker2DlFont
. I used debug to examine the source code of PyPDF, and in the /Font->/Encoding->/Differences
mapping table, characters are mapped to special encodings as follows:
{'/Differences': [35, '/G23', 36, '/G24', 37, '/G25', 38, '/G26', 39, '/G27', 40, '/G28', 41, '/G29', 42, '/G2A', 43, '/G2B', 44, '/G2C', 45, '/G2D', 46, '/G2E', 47, '/G2F', 48, '/G30', 49, '/G31'], '/Type': '/Encoding'}
The font file is decoded using the specified /Filter: /FlateDecode
under /Font->/FontDescriptor->/FontFile3
, but the font file is garbled.
Since Adobe Acrobat can display the text correctly, there must be another way to handle this. I am not very familiar with the structure and protocols of PDF documents, so I am unsure how to resolve this issue.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Windows-10-10.0.19044-SP0
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('cryptography', '42.0.2'), PIL=10.2.0
Code + PDFex
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader(pdf_path)
number_of_pages = len(reader.pages)
print(f"Number of pages: {number_of_pages}")
for i in range(number_of_pages):
if i != 3:
continue
page = reader.pages[i]
text = page.extract_text()
print(text[:5000])
Share here the PDF file(s) that cause the issue. GB+15322.2-2019.pdf
Traceback
This is the complete traceback I see:
page 3 (start 0):
print result:
The fact that Adobe is able to display glyphs (images or drawings) does not mean it can associate them with some characters. copy paste using acrobat reader, pdf.JS (firefox) or PDFium (chrome) does not provide results. I strongly doubt, there is an easy way to extract data. My only approach would be to build/print to images and then use an OCR to extract text. This is out of pypdf capabilities.
As far as I have seen yesterday, pdftotext/poppler would indeed provide somehow valid results for page 4.