ufo-spec icon indicating copy to clipboard operation
ufo-spec copied to clipboard

Canonically equivalent names and Common User Name to File Name Algorithm

Open moyogo opened this issue 6 years ago • 4 comments

Glyph (or layer) names may be canonically equivalent, for example "é" <00E9> and "é" <0065, 0301>, or "Ω" <03A9> and "Ω" <2126>. On some filesystems (HFS+, APFS and others) these two would be normalized to the same string when used in filenames.

The Common User Name to File Name Algorithm should handle such a case and produce unique filename on all filesystems.

moyogo avatar Dec 12 '18 19:12 moyogo

  • would these differences happen because a UFO is opened on a different platform?
  • or is there a use case (and environment) in which a user would be able to visually differentiate between "é" <00E9> and "é" <0065, 0301>?

LettError avatar Dec 12 '18 21:12 LettError

Good point.

would these differences happen because a UFO is opened on a different platform?

Yes. Creating a UFO with "é" <00E9> or "é" <0065, 0301> and "Ω" <03A9> or "Ω" <2126> in a glyph or layer name on HFS+ or APFS will write files or folders with "é" <0065, 0301> or "Ω" <03A9> while keeping the original characters in contents.plist. Opening such a UFO on a file system that doesn’t normalize will fail if the values in contents.plist don’t match the file and folder names.

or is there a use case (and environment) in which a user would be able to visually differentiate between "é" <00E9> and "é" <0065, 0301>?

No, canonically equivalent character strings should be treated and displayed as the same strings. There are some environments where this is not the case.

Technically, for Unicode compliance, [glyph and layer] names should always be normalized for search and comparison, so for example font[unicodedata.normalize("NFC", glyph_name)] == font[unicodedata.normalize("NFD", glyph_name)] would be True.

However this wouldn’t solve the HFS+/APFS to FAT32/NTFS/extfs issue, unless normalization is also done when matching glyph filenames to contents.plist values and layer folder names to layercontents.plist values.

So really this should be about normalizing names in searches and comparison not just in the Common User Name to File Name Algorithm.

moyogo avatar Dec 13 '18 22:12 moyogo

I'm not terribly familiar with the details of all this. Could you describe how this could be resolved and I'll work it into the spec?

typesupply avatar Aug 12 '20 15:08 typesupply

Basically Windows (NTFS, RefFS, exFAT and FAT+LFN) enforces an NFC normalization, whereas Mac (HFS+ and APS) enforces an NFD normalization. Both are canonically equivalent, so whever the filesystem is internally using NFC or NFD should not matter. Linux/Unix filesystems and URLs do not enforce any normalization (so they may allow aliases to resources that are in fact distinct).

The application (and formats referencing filenames) should always perform a canonical normalization internally (preferably NFC), and then on MacOS filenames stored as NFD will be used as if they were NFC. Nowhere we should allow unnormalized names

This also means that allowed filenames should not accept any reserved/unassigned Unicode codepoints, even if filesystems may accept to store and retrieve them, because their normalization is not warrantied to be stable; this also means that fonts made for characters/scripts still not formally encoded may only use PUA assignments, on which NO normalization at all occurs.

Such PUA usage in fonts (many of them are possible during the encoding process) should have some metadata tracking the vendor managing their private assignments and some tracing date, possibly via a reference URI e.g. somewhere in the ConLang registry): that URI may point users to conversion data (at least some documentation or discussion, but may be some formal datafile to be defined later if this is a common need, or specific data required by the vendor) allowing to reprocess these private assignments to standard assignements that will possibly come later with their new normalization data in the UCD, but possibly with possible desunification: no remapping is possible without manual patching and review of the initial PUA-based font data if there's been a desunification or different rules; but if there was an unification, may be there are new rules for added "character variants" where what was initially a single PUA could become several codepoints to select those variants, or could add other characters like joining controls).

But bsically for filename algorithms, all accepted ones should be canonically normalized everywhere, independantly of the filesystem used (I'm not advocating for any "compatibility normalization, so let's ignore NFKC and NFKD here; except if filesystems enforces some restrictions/remapping, which typically occurs only on very few characters: basically whitespaces, dashes/hyphens, possibly some East-Asian "wide" and "narrow" variants). Now it remains the question of legacy characters that were encoded in Unicode only for compatiblity with previous standards: they may have their own mapping in fonts, not necessarily unified with the non-compatiblity characters which would only act as good "fallbacks" for renderers (but possibly with some additional constraint for "synthetic" composition e.g. on their basic intended metrics). But these compatiblity characters may be perfectly acceptable as distinct filenames (and won't be affected by the enforcement of canonical normalizations).

Very few of theses compatiblity characters (which exist in Unicode only for roundtrip compatiblity with older non-UCS encodings), such as one Greek combining mark (which precomposes to combining marks) won't need any additional metrics (and probably they can be implicitly remapped in Unicode-based fonts) as this should not affect at all their rendering (except in a tricky rendering mode like "visible controls" where the rendered text should be able to distinguish every NFC/NFD-normalized combining sequence, including possibly some separate glyphs for joiner controls that prevent some compositions in such rendering mode, but probably won't make any difference of rendering for such legacy precombined combining character).

There's not a lot of compatibility characters in the UCS, we can easily define stable rules for all of them to determine their behavior, in several groups (an important group exists in CJK ideographs, and with "narrow", "wide", "sub" and "sup" compatiblity mappings in the UCD

Normally this set of compatiblity characters won't extend soon. The encoding rules are now stricter and more stable, and the stability rules are standardized and enforced since long now in Unicode and ISO/IEC 10646, so that NFC and NFD are also stable for all encoded characters and don't need any new compatiblity mappings (rare exceptions were allowed and passed at Unicode and ISO/ICE for some added IPA symbols in Latin, but they have caused problems (now solved); there's also been an exception for the capital German sharp S "Ess-Tsett" when it was added, without the standard case mapping, but this did not create any new compatibility mapping, and this will likely be the same if the UCS adds uppercased variants for some Latin lowercase letters created by IPA symbols but borrowed for some African languages adding their own capitals: the IPA symbol won't be changed, instead there will be special case mappings added outside the base UCD, but this will normally not affect the rendering in fonts, except with some text-transform effects that text renderers will infer themselves).

Other possible special behavior are for compatiblity with mathematical symbols (notably those derived from digits and some letters of common alphabets, plus some operators, which enforce some metrics, font styles, or specific positioning and shape variants): this set is more complexe and tends to grow over time (this may also affect chemical symbols needing their variants encoded distinctly, maybe but not necessarily with compatiblity mappings).

Another complex subset starts appearing with emoji symbol variants (but they are all encoded without any compatiblity mappings, so this does not affect any canonical normalization): this set is complex because of their composition layout

The same would be true if there are future efforts for better supporting hieroglyphic scripts with very complex composition rules, not supported natively by normal text renderers without using some other rich-text or graphical document format (this is also true stenographic and VisibleSpeech scripts, possibly as well for other notational scripts used in arts, music and dance, possibly later also for maths, chemistry, advanced research, engineering), or for adding support variable directions in historic texts, like boustrophedon, but for now most efforts and issues are in East Asian scripts for their vertical layout): we are not ready to see any improvement to the BiDi algorithm that will work in variable directions (it's already complex to manage Bidi not just in plain text but inside documents and in applications UI design, and support of vertical layouts is still very poor/minimalist). If such thing starts appearing later, I don't think they will ever use any "canonical mappings" but will use their own normalization schemes, not depending on NFC/NFD (creating renderers and fonts supporting these features will be a challenge, but for now all efforts do not require any change in font specs, as they use or develop their own document formats, far from reaching any normalization step).

verdy-p avatar Dec 09 '22 20:12 verdy-p