pypdf Implementation of advanced cmap encodings

Implementation of advanced cmap encodings

Open stefan6419846 opened this issue 6 months ago • 9 comments

Currently, I am trying to extract text from PDF files which partially report some warnings like

/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_cmap.py:183: PdfReadWarning: Advanced encoding /GBK2K-H not implemented yet
  warnings.warn(
/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_cmap.py:183: PdfReadWarning: Advanced encoding /GBK2K-V not implemented yet
  warnings.warn(

I have seen this for the both encodings mentioned above and for /StandardEncoding.

Digging through the available resources related to the GBK2K cmaps, I found some Adobe resources as well as the implementation from pdfminer.six, which ships some custom pickled files derived from the Adobe open source components to handle such cases.

Is there any guidance available on how to tackle this or how we would like to see this added to pypdf?

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.3, crypt_provider=('pycryptodome', '3.18.0'), PIL=10.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('file.pdf')
page = reader.pages[0]
print(page.extract_text())

For now, I have no uncritical file I could share here. Looking at the example file, it seems like in this case it is a scan of a document (from a Canon device?) with Latin characters with wrongly configured or strange OCR, yielding a mix of Latin and Chinese characters inside the text layer.

Traceback

warnings.warn as currently used only prints the pypdf code line this occurred, thus there is not much of a traceback.

Dec 22 '23 10:12 stefan6419846

Is there any guidance available on how to tackle this or how we would like to see this added to pypdf?

No, there is none. I guess only @pubpub-zz can help you with that.

Dec 25 '23 11:12 MartinThoma

@stefan6419846 try to modify _cmap.py with

_predefined_cmap: Dict[str, str] = {
    "/Identity-H": "utf-16-be",
    "/Identity-V": "utf-16-be",
    "/GB-EUC-H": "gbk",  # TBC
    "/GB-EUC-V": "gbk",  # TBC
    "/GBpc-EUC-H": "gb2312",  # TBC
    "/GBpc-EUC-V": "gb2312",  # TBC
    "/GBK-EUC-H": "gbk",  # TBC
    "/GBK-EUC-V": "gbk",  # TBC
    "/GBK2K-H": "gb18030",  # <- new
    "/GBK2K-V": "gb18030", # <- new
    # UCS2 in code
}

Dec 27 '23 13:12 pubpub-zz

@pubpub-zz Thanks for pointing this out. It seems to indeed work.

When looking at this, two questions arose for me:

Why do we not declare the complete mapping already if this seems to be easy enough to do? https://github.com/adobe-type-tools/cmap-resources lists quite some more possible character maps.
Is there an easy way to generate corresponding test data? Assuming that cmaps are rather essential, I would have assumed that there are some sample files, but doing a quick search, I could not really find some.

Dec 29 '23 10:12 stefan6419846

pypdf pypdf copied to clipboard

Implementation of advanced cmap encodings

Environment

Code + PDF

Traceback

pypdf
pypdf copied to clipboard