9.8.1 Table 120: FontFamily should be a text-string not byte-string
Describe the bug
Table 120 has FontFamily:
(Optional; PDF 1.5) A byte string specifying the preferred font family name. EXAMPLE 1 For the font Times Bold Italic, the FontFamily is Times
An optional field and informative only, but it should certainly be "text string" not "byte string" - it's clearly human-readable and it could certainly be non-ASCII too.
Hhmmm, in the PDF Reference versions 1.5 and 1.6 that entry was a mere "string". In version 1.7 it became a "byte string".
I would assume there was a reason for that change. Possibly the name here was meant to be identical at byte-level to the corresponding entry in the 'name' table of the font.
PS: Ah, I just realized that the term "byte string" has only been introduced in Reference 1.7. Thus, someone had to read the reference and decide for each string which exact type of string it was. So it is possible that there is no such well-thought-of reason for making the font family a byte string as I assumed above... ;)
This specific change was made somewhere between ISO 32000-1:2008 and the 2017 first edition of PDF 2.0.
The introduction of "byte strings" was done by Adobe (not ISO) in their version of the PDF 1.7 reference, prior to submission to ISO. See Table 3.32 in their edition:
I have vague memories of discussing this many years ago but will need to research all the comments submitted against ISO 32K over about 9 years. I'd be guessing that its because there is no intended or specific encoding of the data defined and FontFamily is not defined to be displayed anywhere so any sequence of bytes is valid. Of course, a lot of water has also gone under the bridge since then...
@lrosenthol - do you have any PDF archaeological records as to this change?
Some more context. This information comes from the name table of an OpenType font, where it is normally a UTF-16BE String (https://learn.microsoft.com/en-us/typography/opentype/spec/name) however there are some legacy exceptions to that (see the last Note on that page).
We're obviously not embedding it as raw array of UTF-16BE bytes in the PDF.
When creating a PDF and wanting to set this field, a PDF creator is going to receive it from whichever Font API they're using as their programming languages version of a "text string", because that's how the APIs present it - see eg
- https://harfbuzz.github.io/harfbuzz-hb-ot-name.html,
- https://docs.oracle.com/en/java/javase/11/docs/api/java.desktop/java/awt/Font.html#getFamily()
When consuming a PDF this field is optional, but if it were used, the most likely context would be trying to find a match for an unmebedded font in the PDF with one installed on the OS. And again, this involves interacting with a Font API, which will expect the font family as a string.
But a key point I take away from the UTF-16BE definition at https://learn.microsoft.com/en-us/typography/opentype/spec/name, is that they support full BCP-47 language codes whereas PDF only supports the 2-char codes e.g. it even quotes "zh-Hant” in an example which is illegal in a PDF Unicode string - hence the need for something more flexible. I also assume that the UTF-16BE BoM is not present in the OpenType strings so again if its to byte match the encoding is not PDF UTF-16BE compatible...
The language code is a bit of a red-herring. Yes, OpenType have very different language codes to PDF, they have very different language codes to BCP-47 too - see https://learn.microsoft.com/en-us/typography/opentype/spec/languagetags
But they're not really applicable here. If this field is used for anything, it's used for matching a font on the OS, and there's no expectation that is done in a language-dependent way. If a font designer creates a font called "Foo" and decides its translation in French is actually "Arial", it shouldn't be chosen over normal "Arial" if the document (or OS) happens to be set to French. CSS, for example, doesn't do this, and font matching is done in CSS many, many orders of magnitude more often than it will ever be done in PDF.
Luckily, Font Family names aren't generally localised like this: "Times New Roman" is the same in French. The only time this is really going to come up is with non-latin alphabets, and that's precisely why "byte string" is inappropriate.
/FontFamily <7d30660e9ad4>
If FontFamily is a byte string, what do I do with that? Turn it into an ISO8859-1 string and try and match it to a font on the OS? As a byte string, this has no value.
/FontFamily <feff7d30660e9ad4>
If FontFamily is a text string, I know exactly what to do with this - that's 細明體, the Chinese name for MingLiU. I can pass that to whatever API I use to get my fonts to find it.