afdko icon indicating copy to clipboard operation
afdko copied to clipboard

for glyphs named uniXXXX, is makeotf following the documented GOADB behavior?

Open mbutterick opened this issue 4 years ago • 15 comments

According to the docs:

a) If the third field of the GOADB record for a glyph contains a Unicode value in the form uniXXXX or uXXXX[XX] (see note), assign that Unicode value to the glyph. Else b);

b) If a glyph name is in the Adobe Glyph List For New Fonts, use the assigned Unicode value. Else c);

c) If the glyph name is in the form uniXXXX or uXXXX[XX] (see note), assign the Unicode value. Else d);

d) Do not assign any Unicode value.

The use of “else” in the rules above suggests that if the conditions of that branch are met, then the behavior is triggered, and the heuristic descends no further.

In particular, if a glyph is named uniXXXX and has an explicit codepoint provided in the third column of its GOADB record, then the first branch of this rule should be triggered: the explicit codepoint should supersede the codepoint-inference rule in the third branch.

But this doesn’t seem to be what happens. Rather, if a glyph is named uniXXXX, it gets a codepoint of XXXX, even if the GOADB says otherwise.

Let’s have a test case! Attached is a PFB with one glyph called uni2032 and a GOADB asking that it receive PUA codepoint uE00D4:

 uni2032 uni2032 uE00D4

(Don’t forget the blank line at the end! makeotf will crash without it!)

We expect that the glyph named uni2032 should get the codepoint uE00D4. I generate the font like so:

makeotf -f unicode-test.pfb -gf GOADB

By the way, the docs also say:

the –r or –ga options are NOT specified, the effect is to use the Unicode assignments from the third column of the GOADB without renaming the glyphs.

Yes! We want that codepoint from the third column.

But here’s what we see in the cmap of the generated OTF:

<?xml version="1.0" encoding="UTF-8"?>
<ttFont sfntVersion="OTTO" ttLibVersion="3.6">

  <cmap>
    <tableVersion version="0"/>
    <cmap_format_4 platformID="0" platEncID="3" language="0">
      <map code="0x2032" name="uni2032"/><!-- PRIME -->
    </cmap_format_4>
    <cmap_format_6 platformID="1" platEncID="0" language="0">
    </cmap_format_6>
    <cmap_format_4 platformID="3" platEncID="1" language="0">
      <map code="0x2032" name="uni2032"/><!-- PRIME -->
    </cmap_format_4>
  </cmap>

</ttFont>

What we see is that our uE00D4 is nowhere to be found, and instead the glyph carries the never-asked-for u2032 codepoint.

unicode-test.zip

mbutterick avatar Dec 18 '20 22:12 mbutterick

(Don’t forget the blank line at the end! makeotf will crash without it!)

This has been fixed a while ago I think.

Edit: confirmed fixed, the trailing newline is no longer needed.

frankrolf avatar Dec 18 '20 23:12 frankrolf

What would be the use case of overriding a deliberately named uXXXX glyph with a different code point?

frankrolf avatar Dec 18 '20 23:12 frankrolf

I have a tool that folds suffixed glyphs into the default positions of a font source. So I might have uni2032 and uni2032.alt. I want to make the alt the default glyph by taking the codepoints that would be assigned ordinarily to uni2032 and giving them to uni2032.alt. This works for any glyph not named uniXXXX, because of this behavior in makeotf.

In any case, I think the codepoint-resolution rule is correct: the third column should always supersede anything else. Otherwise one of the explicit promises of the GOADB file is violated, namely that it should “Override the default Unicode encoding by MakeOTF”.

mbutterick avatar Dec 18 '20 23:12 mbutterick

I did some testing and can confirm the bug. Summary: Unicode implied by the final glyph name (first column) currently cannot be overridden by the 3rd column.

Examples:

.notdef	.notdef
uni0020	uni0020
uni0041	uni0041	uni0058
uni0042	uni0042	uni0059
uni0043	uni0043	uni005A

→ writes code points for A B CX Y Z expected


.notdef	.notdef
uni0020	uni0020
uni0041	A	uni0058
uni0042	B	uni0059
uni0043	C	uni005A

→ writes code points for A B CX Y Z expected


.notdef	.notdef
uni0020	space
A	A	uni0058
B	B	uni0059
C	C	uni005A

→ writes code points for X Y Z


This problem is completely GlyphOrderAndAliasDB-based – the final glyph name does not show up anywhere else. While this behavior is confusing (and I agree it feels buggy), I suggest the following workarounds:

  • instead of swapping out glyphs via Unicode override, swap the names in the left column
  • do the Unicode swapping in the generated font binary via fontTools

frankrolf avatar Dec 21 '20 16:12 frankrolf

instead of swapping out glyphs via Unicode override, swap the names in the left column

Meaning, I should rename uniXXXX glyphs to something else? Or am I misunderstanding.

do the Unicode swapping in the generated font binary via fontTools

For various reasons I can’t do that in this case, but yes I agree that would work.

mbutterick avatar Dec 21 '20 18:12 mbutterick

If I understand correctly, you want A to be the default glyph in one project, while A.alt would be the default (read: triggered by code point U+0041) in another. You can create GlyphOrderAndAliasDB files on a per-project basis:

GlyphOrderAndAliasDB_1

A	A
B	B
C	C
A.alt	A.alt

GlyphOrderAndAliasDB_2

A.alt	A
B	B
C	C
A	A.alt

You can use makeotf’s -gf mode to specify one or another GlyphOrderAndAliasDB file.

frankrolf avatar Dec 21 '20 21:12 frankrolf

OK, I see what you mean. Yes, I am also doing substitutions like A for A.alt. But for those glyphs, I can override the codepoint in the third column, so there’s no need for a workaround. The codepoint override only fails for glyphs named uniXXXX.

For instance, suppose I have a test font with uni2032 and uni2032.alt and a GOADB like this:

uni2032        uni2032.alt
uni2032.alt    uni2032

This is the resulting cmap:

<?xml version="1.0" encoding="UTF-8"?>
<ttFont sfntVersion="OTTO" ttLibVersion="3.6">

  <cmap>
    <tableVersion version="0"/>
    <cmap_format_4 platformID="0" platEncID="3" language="0">
      <map code="0x2032" name="uni2032"/><!-- PRIME -->
    </cmap_format_4>
    <cmap_format_6 platformID="1" platEncID="0" language="0">
    </cmap_format_6>
    <cmap_format_4 platformID="3" platEncID="1" language="0">
      <map code="0x2032" name="uni2032"/><!-- PRIME -->
    </cmap_format_4>
  </cmap>

</ttFont>

I’m not sure what I expected to see here, but this cmap is the same as the one above.

unicode-test2.zip

mbutterick avatar Dec 21 '20 22:12 mbutterick

I have a feeling that the cmap dumps are not even needed here. Since we are only moving within GlyphOrderAndAliasDB territory, this becomes a purely theoretical problem. We know that the code point given to a glyph is implied by the final name (left column), why not shuffle that name around? It seems odd to insist on a uniXXXX name to then override it.

Like this:

X	A	# converted to X
Y	B	# converted to Y
Z	C	# converted to Z

or this

uni0058	A	# converted to X
uni0059	B	# converted to Y
uni005A	C	# converted to Z

or this

uni0058	uni0041	# converted to X
uni0059	uni0042	# converted to Y
uni005A	uni0043	# converted to Z

frankrolf avatar Dec 21 '20 22:12 frankrolf

We know that the code point given to a glyph is implied by the final name (left column), why not shuffle that name around?

All my glyphs have a PUA codepoint (possibly in addition to one or more non-PUA codepoints). So yes, your suggestion would work, though it would preclude me from using those PUA codepoints (because the first-column name would completely determine the codepoint).

mbutterick avatar Dec 21 '20 23:12 mbutterick

instead of swapping out glyphs via Unicode override, swap the names in the left column

AFAICT the problem with this workaround that OT feature code is tied to the existing glyph names, and this kind of glyph renaming would have unintended side effects.

mbutterick avatar Jan 16 '21 15:01 mbutterick

OT feature code can be written using either “friendly” names (middle column) or final names (left column). The ability to give human-readable names to glyphs in OT feature context is a big reason for the GlyphOrderAndAliasDB to exist.

Test project attached. feature test.zip

frankrolf avatar Jan 16 '21 15:01 frankrolf

Right. To my mind, if I’m making new names in the GOADB, and then I have to rename all the glyphs in the feature file, I might as well just rename the source glyphs in the first place to avoid this buggy uniXXXX name pattern.

mbutterick avatar Jan 16 '21 16:01 mbutterick

This (or related) behavior has come up again for me in an all-caps font, where Unicode overrides are used for non-AGD glyph names:

The following GlyphOrderAndAliasDB snippet results in only uni0136 and uni013B code points (capital variants) – uni0137 and uni013C missing from the OTF. Typing the lowercase characters will result in a .notdef.

uni0136	Kcommaaccent	uni0136,uni0137
uni013B	Lcommaaccent	uni013B,uni013C

This might border on an existential type question, but I would argue that the Unicode override (3rd column) should take precedence over whatever code point is applied in the first column. At the least, I would expect some kind of feedback – if this indeed not allowed, and not just a bug in makeotf.

I would like to discuss this further – any thoughts from the team?

frankrolf avatar Oct 15 '21 11:10 frankrolf

The behavior observed above basically breaks docs rule a), as outlined at the very top of this issue.

frankrolf avatar Oct 15 '21 11:10 frankrolf

I would like to discuss this further – any thoughts from the team?

Going back to rule a) referenced -- the current behavior does seem like a bug when weighed against that documentation. I don't have enough history on this to say whether the documentation has ever been correct (i.e. the tool used to work that way but has had some regression) or if the documentation was just wishful thinking. The documentation makes sense to me on general principle so maybe we should just treat it as being correct, and update the tool to do what it says.

josh-hadley avatar Oct 15 '21 17:10 josh-hadley

I have encounter this issue and require immediate help. The Kangxi radicals are all compatibility characters to their Unified Ideographs counterpart (e.g. U+2F00 is equivalent to U+4E00 ). Same goes to CJK Compatibility Ideographs too.

When making CJK fonts, it is usually good practice to map all the Kangxi radical codepoints to their Unified Ideographs for correct locale display and save some precious GIDs(as does Source Han), but the behaviour in this issue is blocking the GOADB from working correctly. (this issue did not affect Source Han fonts as they were CID-keyed fonts — the cmap file/-ch option is used instead) There is no workaround that can be used for this instance, the codepoints must be merged to one glyph.

Test case:

uni4E00	uni4E00	uni2F00,uni4E00
uni4E28	uni4E28	uni2F01,uni4E28
uni4E36	uni4E36	uni2F02,uni4E36
uni4E3F	uni4E3F	uni2F03,uni4E3F
uni4E59	uni4E59	uni2F04,uni4E59
uni4E85	uni4E85	uni2F05,uni4E85
uni4E8C	uni4E8C	uni2F06,uni4E8C
uni4EA0	uni4EA0	uni2F07,uni4EA0
uni4EBA	uni4EBA	uni2F08,uni4EBA
uni513F	uni513F	uni2F09,uni513F
uni5165	uni5165	uni2F0A,uni5165
uni516B	uni516B	uni2F0B,uni516B
uni5182	uni5182	uni2F0C,uni5182
uni5196	uni5196	uni2F0D,uni5196
uni51AB	uni51AB	uni2F0E,uni51AB
uni51E0	uni51E0	uni2F0F,uni51E0

NightFurySL2001 avatar Feb 28 '23 15:02 NightFurySL2001

@NightFurySL2001 Yes, that's really bad. I checked and saw where it's assigning a single value when the glyph name is already a name like uni4E00. I fixed in it in https://github.com/adobe-type-tools/afdko/tree/zqs-goadb-fix and tested and it's working for me. We'll check and add tests before getting this into the release, but if you want to try that branch it should be good for this.

punchcutter avatar Mar 01 '23 07:03 punchcutter

Thank you @punchcutter, the branch fix resolved the issue. I hope this get pushed to release soon.

NightFurySL2001 avatar Mar 01 '23 11:03 NightFurySL2001