Character and Unicode mapping is incorrect for CID fonts with embeded CMaps
In theory pdfminer.six has a CMapParse which is capable of parsing embedded CMaps defined in the Encoding field of a Type0 font specification.
In practice, it doesn't do that at all... it only parses ToUnicode CMaps: https://github.com/search?q=repo%3Apdfminer%2Fpdfminer.six%20CMapParser&type=code
This is a problem because some PDFs will actually define their own, more exotic mappings of byte strings to CIDs in Type0 fonts. So pdfminer.six is not able to get the right widths, etc, for characters in PDFs that use these because it cannot map them to any CIDs.
There is a more visible problem, which is that it is also unable to extract any text from them. This is because its handling of ToUnicode CMaps is actually entirely incorrect (and unfortunately PLAYA has inherited this, which I am in the process of fixing at the moment).
Specifically, pdfminer.six assumes that the mapping from a byte sequence in an object stream to a Unicode string goes like this:
b'ABC' => [cid(A), cid(B), cid(C)] => ["A", "B", "C"]
This is incorrect. Instead, ToUnicode is intended to map byte sequences directly to Unicode characters, so:
b'ABC' => ["A", "B", "C"]
The Encoding CMap (which could be an embedded one as noted above) does a separate mapping of byte sequences to CIDs which has nothing to do with text extraction. This only happens to work most of the time in pdfminer.six because either there is no CMap, or the CMap is an identity CMap, so the input bytes and the CIDs are the same, or one of the predefined Unicode CMaps is used (see below).
Here are some samples from pdf.js that illustrate the problem (pdfminer.six cannot extract text from them):
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue2931.pdf https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue7901.pdf https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue9534_reduced.pdf https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue18117.pdf
The pdf.js code is really quite clear for this.
- From an input byte string, first it reads variable-width character codes according to the ranges defined in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3454
- The CID (called
widthCodehere but it's the CID) is looked up in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3350 - The Unicode string representation is looked up in the
ToUnicodemap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3363
And then some other stuff happens ;-) but the important point here is that Encoding and ToUnicode maps, while they both have the form of CMaps, are really totally separate and different things.
The source of the confusion here is because Adobe's "standards" are contradictory, see below ~due to the special case (represented by pdfminer/cmap/to-unicode-*) of Unicode conversion for predefined CMaps. This is indeed done by mapping the CID to a "Unicode value" (presumably a code point in UCS-2) using a special CMap.~
But this particular CMap is not a ToUnicode map, it is simply a special CMap whose CID values can be interpreted as Unicode code points. See PDF 1.7 section 9.10.2.
This logic is implemented in conformance with the PDF 1.7 specification in pdf.js here: https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L3796
The plot thickens here - if you read Adobe Technical Note #5411, which to their credit, the authors of pdfminer.six clearly did, and which is referenced in the PDF 1.7 specification, then you would assume that ToUnicode maps are intended to apply to CIDs:
In order to derive content from PDFs that embed CIDFonts based on other character collections, a “ToUnicode” mapping file must be created, and properly installed for use with Distiller. This “ToUnicode” mapping file shall become part of the PDF, to ensure portability. This file, which follows CMap-style syntax, maps CIDs to Unicode UTF-16BE character codes. Because a “ToUnicode” mapping file is used to convert from CIDs (which begin at decimal 0, which is expressed as 0x0000 in hexadecimal notation) to Unicode code points, the following “codespacerange” definition, without exception, shall always be used: 1 begincodespacerange <0000> <FFFF> endcodespacerange
But this is entirely wrong! If you continue reading the PDF 1.7 standard, which the authors of pdf.js did (probably after encountering many curious PDFs), it goes on to say something totally different:
The CMap file shall contain begincodespacerange and endcodespacerange operators that are consistent with the encoding that the font uses. In particular, for a simple font, the codespace shall be one byte long. It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding.
Note: character codes which are not the same thing as CIDs and are obviously not always two bytes!
Cue Spiderman pointing at Spiderman image with both Spidermen labeled "Adobe"!
I don't really know why the PDF 1.4 standard added a reference to that utterly misleading technical note, but it should be ignored.
That said, the correct definition of ToUnicode is really a strict superset of the one in the technical note - basically you just have to actually respect the codespace ranges, and it covers both cases, and this is what pdf.js does.
Thanks for this. I understand this for 80% now. Would this be easy to fix in our codebase?
Thanks for this. I understand this for 80% now. Would this be easy to fix in our codebase?
Yes, though the fix is a quite invasive, as it implies:
PDFFont.decodeneeds to return CIDs and Unicode together- Code that calls
PDFFont.decodeneeds to not assume that a CID can always be converted to Unicode - Custom CMaps for CIDFonts need to be parsed and used to map byte sequences to CIDs
- Custom ToUnicode maps (for all fonts) need to be parsed and used to map byte sequences to Unicode
You can probably do this by simply taking cmapdb.py and font.py from PLAYA: as they are mostly compatible with pdfminer.six (notably, they still use the same precompiled cmapdb) though they could stand to be refactored a bit:
https://github.com/dhdaines/playa/blob/main/playa/font.py https://github.com/dhdaines/playa/blob/main/playa/cmapdb.py