webauthn
webauthn copied to clipboard
Unicode "tag" characters are deprecated for language tagging
6.4.2. Language and Direction Encoding https://www.w3.org/TR/webauthn-2/#sctn-strings-langdir
The first encodes a language tag with the code point U+E0001 followed by the ASCII values of the language tag each shifted up by U+E0000. For example, the language tag “en-US” becomes the code points U+E0001, U+E0065, U+E006E, U+E002D, U+E0055, U+E0053.
The use of Unicode language tag characters for language identification is strongly deprecated by Unicode. Introducing these language tag characters on the wire is probably not desirable. Other standards have generally introduced specific encoding mechanisms, such as JSON-LD's I18N namespace, to allow language tags and direction metadata to be encoded using ASCII characters and this is preferable. This is especially that case for length-constrained fields, since the language tag characters require 4-bytes per code point in any of the Unicode encodings.
I18N is in the process of modifying our document String-Meta to clarify the best practices in this area.
[Cross-posted from i18n-activity issue]:
Is there any danger of this mechanism being truly broken? It seems that while the consortium has all sorts of negative things to say about the scheme, it is without meaningful direct alternatives. There is no direct alternative for in-band language tagging in plain text, and it also doesn't conflict with other non-deprecated uses, in applications that are aware of the scheme. The deprecation of the Unicode in-band language tagging scheme reads more like a ‘considered harmful’ than a recognizable set of real problems.
TL;DR: I think the reasons for deprecating this are overblown, and don't really have any impact on the particular use case; the scheme is easy to implement and hard to mess up, and isn't going to become broken; so in the limited use case, it is not an issue.
Related: https://w3c.github.io/string-meta/#unicodeTags
I drew an action item to reply directly to @xorgy's comment.
The short answer is: the mechanism isn't "truly broken" from the point of view that one could use it to encode language tags into strings. But Unicode deprecated this use and the technical reasons behind not using this mechanism are pretty long, including:
- the encoding is particularly inefficient (the tag characters and introducer/cancel tags each require 4 bytes in any of the Unicode encodings (UTF-32, UTF-16, UTF-8))
- the mechanism alters string data and requires the introspection and alteration of string data to/from the application; that is, in order to send an arbitrary string value down the wire, I have to use a special function to attach the language tag and another to remove it on the receiving end. Further, the tag characters have to be converted to their ASCII equivalent before being used in e.g. the HTML
langattribute. - not all rendering systems ignore the tags (should one choose not to remove them) and they may show as "tofu" (hollow boxes), interfering with usability
- rendering systems don't understand the tags, so they don't help with font selection, shaping, and other rendering/processing unless they are processed and passed as markup or to
setLanguage/setLocaleAPIs.
There are some advantages to these tags. Notably they can be used "in-band" to tag substrings: most ASCII based syntaxes, such as those found in JSON-LD, only permit the whole of the string to be tagged with a language. However, the encoding efficiency problem remains and someone receiving multi-language string would have to do a fair amount of work to make use of that tagging.
It's much easier (and likely more interoperable) if webAuthn provides a separate optional field or uses one of the ASCII schemes instead. If webAuthn insisted on using this mechanism, I18N would not object. But we don't think it makes sense.