pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

PDFA Clarify whether 6.2.11.3.3 CMaps applies to ToUnicode

Open myang-apryse opened this issue 1 month ago • 5 comments

Describe the bug

In PDFA-2,3,4, there's the line

A CMap shall not reference any other CMap except those listed in ISO 32000-1:2008, 9.7.5.2, Table 118

Which I take to mean that you can't UseCMap anything that's not predefined.

However, it seems that vera (pdf association's ref impl?) does not appear to care about UseCMap for ToUnicode, and I have seen examples of such usages, which I think are technically well defined wrt ToUnicode being type 2 CMaps. e.g. /Adobe-Korea1-UCS2 usecmap

My confusion is that the wording of the section appears to apply to the entire PDF file, anywhere that the CMap format is used, however, the section scoping of it is under 6.2.11.3 Composite fonts, combined with reference behaviour suggests that it has conditioned scoping.

Possible resolutions:

  1. Strict & most general interpretation, the exact same restrictions apply to both ToUnicode & Encoding cmaps wrt this section, no wording change needed, this is a vera bug for not checking it.
  2. This section is intended to make sure Encodings do not break, ToUnicode is not covered. (should clarify/rephrase to make that clear)

Semi-relatedly, I'd like clarification whether this section is specifically scoped to Composite Fonts (type0), and that CMaps used in other font types are not subject to this restriction (e.g. you can/should have ToUnicode CMaps for any font, in order to facilitate translation to unicode, does this not apply for type1 fonts?).

Additional context

Misc note:

  • in PDFA-4, this section is 6.2.10.3.3 CMaps:

A CMap shall not reference any other CMap except those listed in ISO 32000-2:—, 9.7.5.2 Table 116.

myang-apryse avatar Oct 14 '25 18:10 myang-apryse

vera (pdf association's ref impl?)

Not a reference implementation, no.

"usecmap" should be checked. We certainly check it. It's not applicable for ToUnicode cmaps only because there are no standard ToUnicode cmaps you could include - the only standard ones defined in the spec (eg Adobe-Japan-1) are "basefont" cmaps, not "cid" cmaps as used for ToUnicode.

So I think this is "no change required".

faceless2 avatar Oct 14 '25 18:10 faceless2

not applicable ... because there are no standard ToUnicode cmaps you could include

When you say no applicable, do you mean that UseCMap simply can't be used because there are no valid targets? Or do you mean the rule is not applicable, and you can UseCMap anything (as long as it's properly embedded)?

myang-apryse avatar Oct 14 '25 21:10 myang-apryse

Well, "usecmap" references a CMap by name:

/Adobe-Korea1-UCS2 usecmap

So it would have to reference a CMap that is predefined. And there are no predefined ToUnicode cmaps.

faceless2 avatar Oct 14 '25 21:10 faceless2

"basefont" cmaps, not "cid" cmaps as used for ToUnicode.

I'm a bit confused by this statement, is it reversed? The spec says:

The beginbfchar and endbfchar shall not appear in a CMap that is used as the Encoding entry of a Type 0 font; however, they may appear in the definition of a ToUnicode CMap.

Assuming the bf means what you refer to as basefont (as opposed to cid in begincidchar), then encoding uses "cid cmaps", and ToUnicode uses "basefont cmaps"

Separately, I don't believe Adobe-Japan-1 exists in the spec at all, Adobe-Japan1 (no dash) is a character collection, and it appears to have different components, including a ToUnicode map (Adobe_Japan1_UCS2) that's not in table 118.


In any case, exact string references aside, if I'm understanding you correctly, you're saying that the rule stands as written, and you check in table 118 for everything (including ToUnicode), and since it only contains Encodings CMaps, ToUnicode implicitly can't UseCMap at all.

myang-apryse avatar Oct 14 '25 21:10 myang-apryse

Well, "usecmap" references a CMap by name:

I'm using this a bit liberally, there's usecmap operator in the stream text, and there's UseCMap entry in the stream dictionary. I'm not trying to consider the case where they're not aligned, so I'm referring to them interchangeably.

It's possible for it to be just a Name string indicating a predefined CMap (that is in table 118), or it could be a Name string indicating an implementation dependent hardcoded available resource (that's not in table 118), or it could be an embedded CMap stream (the operator refers to it by name, but the entry will have the actual stream).

The point is, only the first one is legal PDFA (if section applies), but the other 2 work/render fine as PDFs if properly defined. This clarification is in case you were trying to say that there's no possible way for it to work other than by referring to a predefined CMap.

myang-apryse avatar Oct 15 '25 18:10 myang-apryse