pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

"pertinent entries" in ToUnicode CMap stream dictionary

Open seehuhn opened this issue 1 year ago • 6 comments

Section 9.10.3 of the PDF-2.0 spec states

The only pertinent entry in the CMap stream dictionary (see "Table 118 — Additional entries in a CMap stream dictionary") is UseCMap, which may be used if the CMap is based on another ToUnicode CMap.

Table 118 lists the following entries as required: Type, CMapName, CIDSystemInfo. Does the above sentence mean that these entries are not required for ToUnicode CMaps? It would be great if the spec could clarify what the meaning of "only pertinent entry" is in this context.

seehuhn avatar Sep 04 '24 11:09 seehuhn

In my opinion yes, they are not required and do normally not make sense.

DietrichSeggern avatar Sep 04 '24 16:09 DietrichSeggern

Some additional thoughts:

I agree that CMapName and CIDSystemInfo are not useful for ToUnicode CMaps.

Even if it turns out that the corresponding fields are not required in ToUnicode CMap stream dictionaries, probably Type should be required?

The only example of a ToUnicode CMap in the spec (Section 9.10.3, Example 2) does include the fields in question:

16 0 obj
<<
/Type /CMap
/CMapName /Adobe-Identity-UCS2
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS2) /Supplement 0 >>
/Length 433
>>
stream
...
endstream

(As mentioned in #344, I suspect that the CIDSystemInfo in the example may be wrong, though.)

seehuhn avatar Sep 04 '24 17:09 seehuhn

Rewording as follows may help distinguish between required keys (which are always required!) and the use of "pertinent":

In addition to the required entries, the only pertinent entry in the CMap stream dictionary ...

So clear that "pertinent" is not attempting to dismiss the required-ness of the other entries.

petervwyatt avatar Sep 06 '24 03:09 petervwyatt

But in any of the PDFs with a ToUnicode CMap that I was just looking at there is none of these entries. Attached is a PASS file taken from the veraPDF testsuite. veraPDF test suite 6-2-10-7-t01-pass-a.pdf Or am I missing something?

DietrichSeggern avatar Sep 06 '24 07:09 DietrichSeggern

Inspired by @DietrichSeggern's comment I checked the PDF files on my laptop: the files contain a total of 60477 ToUnicode CMaps. Here is how often each key in the stream dicts occurs:

  • Length: 60477. Without this the stream cannot be read, so everybody has this.
  • Filter: 60198. Nearly all CMap files are compressed.
  • CIDSystemInfo: 30
  • CMapName: 30
  • Type: 30
  • Length1: 5. There are strange PDF files out there.

So only 30 out of 60477 ToUnicode maps I inspected included the fields in question.

seehuhn avatar Sep 06 '24 07:09 seehuhn

I was just following the bouncing ball of references... clearly not reflecting reality!

I guess the ToUnicode definition does says it is "A stream containing a CMap file..." and doesn't reference the CMap stream dictionary definition in Table 118, but its hard to tell if this legacy language and an explicit nuanced sentence. This is also what the 1st bullet near the end of 9.10.1 implies. The text is generally confusing CMap (the data syntax) with CMap (the PDF stream object).

So maybe in this specific case "pertinent" does mean the only key that you can expect to find in a ToUnicode stream dictionary is UseCMap since it is not a "CMap stream" but a "stream that is a (slightly tweaked) CMap".

If that is true, then the consistent method to correct this would be to add a new Table titled "additional entries in a ToUnicode stream dictionary" and list just UseCMap. This is how all other streams in 32K are defined that have special keys beyond the standard set for streams. That way it would be explicitly unambiguous. But maybe the other CMap stream dictionary keys (like Type) are optional... I really don't know so let's also ask @lrosenthol to do some PDF archeology since extant data doesn't always get things correct.

petervwyatt avatar Sep 06 '24 08:09 petervwyatt

CMapName and CIDSystemInfo are historical, predated embedded CMaps (via UseCMap).

WMode is there for non-Roman (esp. CJK) fonts, but it is already optional.

lrosenthol avatar Nov 04 '24 08:11 lrosenthol

Summarizing what I think has been discovered (please correct if I misunderstood!)

  1. many of the Table 118 entries formally defined as "Required" are not present in extant PDFs and thus are not required by implementations for ToUnicode CMaps streams
  2. the problematic word "pertinent" is not a synonym for "required" since UseCMap is defined as "optional" in Table 118

Thus I think the correct solution is to add a new Table titled "additional entries in a ToUnicode CMap stream dictionary" and copy Table 118 but make everything optional. Then replace the problematic sentence with a simple reference to this new Table.

petervwyatt avatar Nov 18 '24 22:11 petervwyatt

PDF TWG agree

petervwyatt avatar Jan 16 '25 21:01 petervwyatt

Some small oversights in the newly added table:

  1. The CIDSystemInfo entry in the new table has the copied over text "(However, it does not need to match the values of CIDSystemInfo for the Identity-H or Identity-V CMaps.)". I believe this bracket should be removed, since these CMaps cannot be used as ToUnicode CMaps (they map codes to CIDs and not to Unicode character sequences).
  2. WMode seems to make no sense for ToUnicode CMaps, and in particular "specifies the writing mode for any CIDFont with which this CMap is combined" makes no sense, since ToUnicode CMaps cannot be "combined" with fonts.
  3. The entry for UseCMap begins with the words "The name of a predefined CMap". I believe this part should be removed, since there are no predefined ToUnicode CMaps. (I am unsure whether a ToUnicode CMap contains "character mappings", as mentioned later in this entry.)

seehuhn avatar Jan 17 '25 13:01 seehuhn