"pertinent entries" in ToUnicode CMap stream dictionary
Section 9.10.3 of the PDF-2.0 spec states
The only pertinent entry in the CMap stream dictionary (see "Table 118 — Additional entries in a CMap stream dictionary") is UseCMap, which may be used if the CMap is based on another ToUnicode CMap.
Table 118 lists the following entries as required: Type, CMapName, CIDSystemInfo. Does the above sentence mean that these entries are not required for ToUnicode CMaps? It would be great if the spec could clarify what the meaning of "only pertinent entry" is in this context.
In my opinion yes, they are not required and do normally not make sense.
Some additional thoughts:
I agree that CMapName and CIDSystemInfo are not useful for ToUnicode CMaps.
Even if it turns out that the corresponding fields are not required in ToUnicode CMap stream dictionaries, probably Type should be required?
The only example of a ToUnicode CMap in the spec (Section 9.10.3, Example 2) does include the fields in question:
16 0 obj
<<
/Type /CMap
/CMapName /Adobe-Identity-UCS2
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS2) /Supplement 0 >>
/Length 433
>>
stream
...
endstream
(As mentioned in #344, I suspect that the CIDSystemInfo in the example may be wrong, though.)
Rewording as follows may help distinguish between required keys (which are always required!) and the use of "pertinent":
In addition to the required entries, the only pertinent entry in the CMap stream dictionary ...
So clear that "pertinent" is not attempting to dismiss the required-ness of the other entries.
But in any of the PDFs with a ToUnicode CMap that I was just looking at there is none of these entries. Attached is a PASS file taken from the veraPDF testsuite. veraPDF test suite 6-2-10-7-t01-pass-a.pdf Or am I missing something?
Inspired by @DietrichSeggern's comment I checked the PDF files on my laptop: the files contain a total of 60477 ToUnicode CMaps. Here is how often each key in the stream dicts occurs:
- Length: 60477. Without this the stream cannot be read, so everybody has this.
- Filter: 60198. Nearly all CMap files are compressed.
- CIDSystemInfo: 30
- CMapName: 30
- Type: 30
- Length1: 5. There are strange PDF files out there.
So only 30 out of 60477 ToUnicode maps I inspected included the fields in question.
I was just following the bouncing ball of references... clearly not reflecting reality!
I guess the ToUnicode definition does says it is "A stream containing a CMap file..." and doesn't reference the CMap stream dictionary definition in Table 118, but its hard to tell if this legacy language and an explicit nuanced sentence. This is also what the 1st bullet near the end of 9.10.1 implies. The text is generally confusing CMap (the data syntax) with CMap (the PDF stream object).
So maybe in this specific case "pertinent" does mean the only key that you can expect to find in a ToUnicode stream dictionary is UseCMap since it is not a "CMap stream" but a "stream that is a (slightly tweaked) CMap".
If that is true, then the consistent method to correct this would be to add a new Table titled "additional entries in a ToUnicode stream dictionary" and list just UseCMap. This is how all other streams in 32K are defined that have special keys beyond the standard set for streams. That way it would be explicitly unambiguous. But maybe the other CMap stream dictionary keys (like Type) are optional... I really don't know so let's also ask @lrosenthol to do some PDF archeology since extant data doesn't always get things correct.
CMapName and CIDSystemInfo are historical, predated embedded CMaps (via UseCMap).
WMode is there for non-Roman (esp. CJK) fonts, but it is already optional.
Summarizing what I think has been discovered (please correct if I misunderstood!)
- many of the Table 118 entries formally defined as "Required" are not present in extant PDFs and thus are not required by implementations for ToUnicode CMaps streams
- the problematic word "pertinent" is not a synonym for "required" since UseCMap is defined as "optional" in Table 118
Thus I think the correct solution is to add a new Table titled "additional entries in a ToUnicode CMap stream dictionary" and copy Table 118 but make everything optional. Then replace the problematic sentence with a simple reference to this new Table.
PDF TWG agree
Some small oversights in the newly added table:
- The
CIDSystemInfoentry in the new table has the copied over text "(However, it does not need to match the values of CIDSystemInfo for the Identity-H or Identity-V CMaps.)". I believe this bracket should be removed, since these CMaps cannot be used as ToUnicode CMaps (they map codes to CIDs and not to Unicode character sequences). -
WModeseems to make no sense for ToUnicode CMaps, and in particular "specifies the writing mode for any CIDFont with which this CMap is combined" makes no sense, since ToUnicode CMaps cannot be "combined" with fonts. - The entry for
UseCMapbegins with the words "The name of a predefined CMap". I believe this part should be removed, since there are no predefined ToUnicode CMaps. (I am unsure whether a ToUnicode CMap contains "character mappings", as mentioned later in this entry.)