pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

Is the .notdef glyph used for undefined codes?

Open seehuhn opened this issue 2 years ago • 17 comments

Section 9.6.5.2 (Encodings for Type 1 fonts) states:

If an encoding maps to a character name that does not exist in the Type 1 font program, the .notdef glyph shall be substituted.

What happens for a character code which is not assigned any value? For example, this occurs when the encoding is WinAnsiEncoding and the character code is octal 010 (a value not listed in table D2).

I assume that this should also show the .notdef glyph, but I could not find any place in the spec which actually says that this is the case. It would be nice, if the spec would be more explicit about this question.

seehuhn avatar Dec 28 '23 20:12 seehuhn

What happens for a character code which is not assigned any value?

IMO that would be a buggy (i.e. invalid) PDF and the PDF processor behavior is implementation dependent.

I have to admit, though, that AFAIK this is not clearly said in the spec. I'd derive it from section 9.4.3 "Text-showing operators":

A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted. With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.5, "Character encoding".

If a character code shall then be looked up in the font’s encoding, it first of all must have a mapping in that encoding. Thus, a valid text showing operator string operand is built from codes that have a mapping in the encoding of the current font.

The weak spot of this obviously are non-embedded fonts used with their built-in encoding, rendering PDFs valid or invalid depending on the font file variety available to the PDF processor.

mkl-public avatar Jan 08 '24 10:01 mkl-public

@mkl-public For comparison, in the case of composite fonts section 9.7.6.3 (Handling undefined characters) makes it explicit that there CID 0 is used both when a character code is invalid (i.e. is not contained in any codespace range) or does not map to a CID. Thus, if what you propose is the intended behaviour, this would be a difference between simple fonts and composite fonts.

seehuhn avatar Jan 09 '24 11:01 seehuhn

Thus, if what you propose is the intended behaviour, this would be a difference between simple fonts and composite fonts.

Yes, you're right, I wasn't aware of the 9.7.6.3 rules.

Anyway, though, you'll always have differences here. Composite fonts always have the CID 0 glyph to turn to but Type 3 fonts don't have any default glyph...

(As I'm mostly into signed PDFs, I may be biased against any replacement glyphs which possibly change appearances...)

mkl-public avatar Jan 11 '24 16:01 mkl-public

@seehuhn Do you have an actual PDF that demonstrates the problem? Do we know what existing implementations do in this case?

also, as @mkl-public asked - are we looking at this in the embedded case or the non-embedded case? I would assume the embedded case, but want to check...

lrosenthol avatar Feb 06 '24 00:02 lrosenthol

I'll try to produce a PDF ...

seehuhn avatar Feb 06 '24 12:02 seehuhn

Here is a test file. A quick experiment shows the following results (all on MacOSX):

  • Preview on MacOS shows the .notdef glyph
  • Adobe Acrobat Reader shows the .notdef glyph
  • Google Chrome shows a blank space
  • GhostScript shows the .notdef glyph

codes.pdf

seehuhn avatar Feb 06 '24 13:02 seehuhn

The text in question was added to the PDF 1.5 spec. I am investigating the history of why...

lrosenthol avatar Feb 06 '24 14:02 lrosenthol

also, as @mkl-public asked - are we looking at this in the embedded case or the non-embedded case? I would assume the embedded case, but want to check...

I believe the distinction between the embedded and non-embedded case should not matter here: There are no unmapped codes in built-in encodings, so the question can only occur if the encoding is specified relative to a base-encoding different from the built-in one. Thus, the PDF reader will know whether the code maps to a glyph name without referring to information from the font program. And then it needs to decide which glyph to try to show (if any) from the (embedded or not embedded) font program ...

seehuhn avatar Feb 06 '24 14:02 seehuhn

The difference between the embedded font case and the non-embedded font case is that for the same PDF the situation may look differently on different machines (or even different viewers on the same machine). If on one machine the glyph is there and on the other machine it is not (because there only is a cut-down version of the font on the server), I would not be happy with the PDF viewer replacing the missing glyph somehow without notifying the user of the replacement. The idea of PDF after all from the beginning was to have a document to be displayed identically across platforms...

mkl-public avatar Feb 06 '24 15:02 mkl-public

The difference between the embedded font case and the non-embedded font case is that for the same PDF the situation may look differently on different machines (or even different viewers on the same machine). If on one machine the glyph is there and on the other machine it is not (because there only is a cut-down version of the font on the server), I would not be happy with the PDF viewer replacing the missing glyph somehow without notifying the user of the replacement. The idea of PDF after all from the beginning was to have a document to be displayed identically across platforms...

I agree that non-embedded fonts are problematic in general. But how is this worse/different for unmapped codes than for any other situation?

If, hypothetically, the spec would say to show the .notdef glyph here, this glyph might of course look different on different viewers if the font is not embedded. But then, the glyph for A may also look different between viewers in this situation. And if, hypothetically again, the spec would say that the reader can decided what to do for unmapped codes, then the result could look different between readers even for embedded fonts. Or, even more hypothetically, if different rules were added to the spec for embedded and non-embedded fonts, I would not quite see the motivation for this.

The good bit is that the widths for all codes given fixed in the font dictionary, so whatever glyph is shown, it should not mess up the layout for text after the unmapped code.

seehuhn avatar Feb 06 '24 15:02 seehuhn

[...] If on one machine the glyph is there and on the other machine it is not (because there only is a cut-down version of the font on the server), [...]

Reading your reply again, I wonder whether you are thinking about codes which map to glyphs missing from the font, rather than about codes which map to no glyph name at all?

seehuhn avatar Feb 06 '24 15:02 seehuhn

I wonder whether you are thinking about codes which map to glyphs missing from the font, rather than about codes which map to no glyph name at all?

Actually I'm thinking of fonts used via their "built-in encoding".

mkl-public avatar Feb 06 '24 16:02 mkl-public

Actually I'm thinking of fonts used via their "built-in encoding".

I don't think a built-in encoding can have unmapped codes at all. Or am I confused?

seehuhn avatar Feb 06 '24 16:02 seehuhn

PDF TWG: there's no problem here, so no fix.

DuffJohnson avatar May 07 '24 05:05 DuffJohnson

@DuffJohnson What is the answer to my original question, i.e. should a PDF viewer show the .notdef glyph, if a character code which is not assigned a value is encountered for a type 1 font?

Is there no problem, because this is somewhere explained in the spec? Or is there no problem, because the spec intentionally leaves this unspecified?

seehuhn avatar May 07 '24 07:05 seehuhn

Yes, it should show .notdef.

The committee felt that this IS specified, even if implicitly, by the existing text.

DuffJohnson avatar May 07 '24 07:05 DuffJohnson

Thank you! That's what I will do in my code, and as far as I am concerned the issue is resolved then.

I'll leave this github issue for somebody else to close, just in case people want to double check that this is explicit enough in the spec. (I can't still find this information in the spec. Maybe the logic is that "not specifying a character for a code" is very similar to "mapping a code to a non-existent character"?)

seehuhn avatar May 07 '24 10:05 seehuhn