message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

[FEEDBACK] Message Format Unquoted Literals

Open macchiati opened this issue 1 year ago • 1 comments

Summary

Consider relaxing constraints on literals, after v45

Background

Right now, unquoted literals are fairly narrowly constrained by message.abnf ; here are the relevant lines:

unquoted = name / number-literal

; number-literal matches JSON number
(https://www.rfc-editor.org/rfc/rfc8259#section-6)

number-literal = \["-"\] (%x30 / (%x31-39 \*DIGIT)) \["." 1\*DIGIT\]
\[%i"e" \["-" / "+"\] 1\*DIGIT\]

; name matches https://www.w3.org/TR/REC-xml-names/#NT-NCName

name = name-start \*name-char

name-start = ALPHA / "\_"

/ %xC0-D6 / %xD8-F6 / %xF8-2FF

/ %x370-37D / %x37F-1FFF / %x200C-200D

/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF

/ %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF

name-char = name-start / DIGIT / "-" / "."

/ %xB7 / %x300-36F / %x203F-2040

Reason for reconsidering

However, for functions outside of the standard registry, this forces many natural literals to use quotes. Here is an example from a function that would handle MF1’s choice format:

\[0,1) {{{\$count} is zero or fraction}}

The natural literals to use would be intervals, which use [,(,),] characters for ranges (the choice format would require some recasting because it depends on ordering of variants. It currently uses >.) So that would require

\|\[0,1)\| {{{\$count} is zero or fraction}}

Many Unicode symbols are included by XML’s NT-NCName (about 6,000 currently), while many are excluded (about 2,600 currently). But these are literals, not identifiers, which is what name is intended for. By expanding beyond identifier usage, it allows functions to avoid requiring quoting in many cases. It also allows us to dispense with the special formulation for number-literal.

The literals for number, date, etc could be specified elsewhere, but wouldn’t have to be in the ABNF.

That would allow for various registries to have more sophisticated literal without requiring quoting, and without privileging the structured literals that we know about now.

Requirements

So, what restrictions on characters for a broadened definition of unquoted literals would be required by a revised ABNF?

  1. No ‘}’, because it would make .local $x = {literal} fail.

  2. No ‘|’, because an initial one would conflict with quoting, and it is best to just forbid it anywhere in an unquoted literal to prevent confusion.

  3. No ‘{’. Not strictly required, but for clarity wherever used.

  4. None of the big blocks of ‘strange’ code points that XML forbids: controls, surrogates, private-use, noncharacters.

    1. These are all immutable (Unicode Character Encoding Stability).

    2. This also disallows the noncharacters that XML didn’t know about yet, before the noncharacter property was made immutable.

  5. No whitespace, since variant uses that for separators between keys.

    1. This could be done by just disallowing the “s” production characters, but that could be very confusing. {a b} looks too much like two items (the space is an A0 NO-BREAK SPACE). So it should be broadened to the Unicode Whitespace characters.

    2. Unicode Whitespace is not guaranteed immutable, but has not changed for over a decade. Anyway, we would derive the code points as of now, so everything would be stable into the future.

  6. (Any others?)

Not coincidentally, 2-3 are the characters in the reserved-escape production.

Detailed Proposal

This would result in the following change:

OLD

unquoted = name / number-literal

; number-literal matches JSON number
(https://www.rfc-editor.org/rfc/rfc8259#section-6)

number-literal = \["-"\] (%x30 / (%x31-39 \*DIGIT)) \["." 1\*DIGIT\]
\[%i"e" \["-" / "+"\]

// The characters include the following (though name-char and
number-literal additions are positional):

// name-start is \[\\: A-Z \_ a-z \x{C0}-\x{D6} \x{D8}-\x{F6}
\x{F8}-\x{2FF} \x{370}-\x{37D} \x{37F}-\x{1FFF} \x{200C}-\x{200D}
\x{2070}-\x{218F} \x{2C00}-\x{2FEF} \x{3001}-\x{D7FF} \x{F900}-\x{FDCF}
\x{FDF0}-\x{FFFD} \x{10000}-\x{EFFFF}\]

// name-char adds \[\\- . 0-9 \x{B7} \x{0300}-\x{036F}
\x{203F}-\x{2040}\]

// number-literal adds \[+ e\]

NEW

Unquoted = literal-char+

// Then down in ; Restrictions on characters in various contexts

literal-char = _all but following list; simpler to leave in this format
until after feedback._

Needed to avoid syntax conflicts

U+007B LEFT CURLY BRACKET
U+007C VERTICAL LINE
U+007D RIGHT CURLY BRACKET

Whitespace

U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+2000 - U+200A EN QUAD .. HAIR SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

Controls

U+0000 - U+001F
U+007F - U+009F

Surrogates

U+D800 - U+DFFF

Private Use

U+E000 - U+F8FF U+F0000 - U+FFFFD U+100000 - U+10FFFD

Noncharacters

U+FDD0 - U+FFFE U+FFFF U+1FFFE U+1FFFF U+2FFFE U+2FFFF U+3FFFE U+3FFFF
U+4FFFE U+4FFFF U+5FFFE U+5FFFF U+6FFFE U+6FFFF U+7FFFE U+7FFFF U+8FFFE
U+8FFFF U+9FFFE U+9FFFF U+AFFFE U+AFFFF U+BFFFE U+BFFFF U+CFFFE U+CFFFF
U+DFFFE U+DFFFF U+EFFFE U+EFFFF U+FFFFF U+FFFFE U+FFFFF U+10FFFE
U+10FFFF

macchiati avatar Mar 13 '24 04:03 macchiati

We should consider this in a severely timeboxed way. Bear in mind design, which is not directly "on the nose" to this request.

Note that unquoted literals appear in other places than in keys. We previously reserved a bunch of the ASCII punctuation (which is the main consideration here) for future use via reserved-statement. Removing that from the syntax does not mean that we should pilfer the box for more of these characters. Things that spoof sigils in appearance are probably a Bad Idea.

For example, one of the characters not listed above is :, which is the function introducer and namespace separator. It can't be in an unquoted. # and / probably need to be avoided because of markup. And @ because of attributes.

On the other hand, square brackets and parens seems potentially useful as do some of the other junk.

aphillips avatar Sep 11 '24 00:09 aphillips

This won't go in 46.1, so I'm going to change the labels. I am also adding resolve-candidate because I think we won't extend unquoted, but that's for the WG to decide.

aphillips avatar Nov 09 '24 00:11 aphillips

One comment; broadening can be done in future versions, since it would be backwards compatible.

macchiati avatar Nov 09 '24 01:11 macchiati

Broadening can be done, so long as it is done in a backwards-compatible way. It's a little tricky here, because of the uses of literals in the syntax. I haven't carefully reviewed the proposal recently enough to say one way or the other if there are sticky bits that I'd object to. I won't say that we would never do an extension (never is a long time), but I think it unlikely in the 2.0 timeframe (e.g. 46.1/47)

aphillips avatar Nov 09 '24 01:11 aphillips

I would like to have a discussion on this in January.

macchiati avatar Dec 09 '24 19:12 macchiati

I will add this to a meeting agenda in January. The WG asked me to close this last week, as the consensus was that we would not revisit the syntax in the 46.1 timeframe and didn't feel that our previous discussions about text literals would change later.

Our usual process would be to create a design doc (the onus is on the person proposing a change to document the proposal and compare/contrast it with the current design and other options), but let's discuss first.

aphillips avatar Dec 16 '24 17:12 aphillips

I would like at least a clear statement in v47 that the literal syntax could be expanded in future versions to include more characters than just what are in name and the simple number-literal.

macchiati avatar Jan 08 '25 22:01 macchiati

The literal syntax allows nearly any Unicode character. Do you mean unquoted-literal?

aphillips avatar Jan 08 '25 23:01 aphillips

Yes, I meant unquoted = name / number-literal can be expanded.

macchiati avatar Jan 08 '25 23:01 macchiati

I just cleaned up the description a bit.

  1. Added restriction 3
  2. Changed 'unquoted' to 'unquoted-literal' (I think the ABNF might have changed after I wrote the first version)
  3. Added Notes at the end.

@eemeli , any comments?

macchiati avatar Jan 26 '25 15:01 macchiati

We also need to reserve at least @ (attributes) and # & / (markup), and almost certainly = as it's used in options and attributes.

I would be much more comfortable with !, %, *, and ? also being reserved to allow for some future extensibility. ', ", (, ), [, and ] should be reserved to preclude misunderstandings.

eemeli avatar Jan 26 '25 19:01 eemeli

Strictly speaking,

Reserving '@' is not necessary, because syntactically it can't cause ambiguity in the syntax (an attribute can't be in the same position as a literal). Same for '#' and '='.

The '=' because it is required that a literal be separated from an attribute or option identifier by a =. So with someOption=a=b you know that the option value starts after the first =, but it doesn't formally matter that it contains a =. However, it could be confusing to readers.

'/' on the other hand can cause ambiguity in the syntax, because of the following: markup = "{" o "#" identifier *(s option) *(s attribute) o ["/"] "}" The option could have a value that in unquoted, and if / is allowed as a final character in that value then it would collide. It could be allowed as a non-final. That's unfortunate, because it is the simplest character for a rational number. But can't be helped.

Disallowing (, ), [, and ] would be unfortunate, since they are the natural characters for open/closed ranges.

I don't see any math use for ' or ", so no particular reason to allow them in unquoted literals.

However, I admit it would be simpler if the only ASCII characters allowed in unquoted literals were [A-Za-z0-9-+_.].

macchiati avatar Jan 27 '25 04:01 macchiati

Reserving '@' is not necessary, because syntactically it can't cause ambiguity in the syntax (an attribute can't be in the same position as a literal). Same for '#' and '='.

If # is not reserved, {#foo} could parse as either a literal-expression or as markup.

If @ is not reserved, {@foo} would look like an expression with only an attribute, and that would be confusing.

The '=' because it is required that a literal be separated from an attribute or option identifier by a =. So with someOption=a=b you know that the option value starts after the first =, but it doesn't formally matter that it contains a =. However, it could be confusing to readers.

Yes, and we need to keep in mind that we should not presume that people look at MF2 often. Not confusing readers is a high priority.

eemeli avatar Jan 27 '25 06:01 eemeli

We'll also need to exclude * as unquoted-literal is also used in variant keys, where * is special.

eemeli avatar Jan 27 '25 17:01 eemeli

proposal:

  1. In ASCII, only [A-Za-z0-9-+_.]
  2. Disallow all of the big blocks of ‘strange’ code points that XML forbids: controls, (unpaired) surrogates, private-use, noncharacters.

These are all immutable (Unicode Character Encoding Stability).

This also disallows the noncharacters that XML didn’t know about yet, before the noncharacter property was made immutable.

  1. No whitespace, since variant uses that for separators between keys, and expressions use it to separate various components.

This could be done by just disallowing the “s” production characters, but that could be very confusing. {a b} looks too much like two items (the space is an A0 NO-BREAK SPACE). So it should be broadened to the Unicode Whitespace characters.

Unicode Whitespace is not guaranteed immutable, but has not changed for over a decade. Anyway, we would derive the code points as of now, so everything would be stable into the future.

macchiati avatar Jan 27 '25 18:01 macchiati

I just made a census of characters inside and outside of name_char:

Property in/outside of name_char Count
\p{Letter} in name_char: 141,025
\p{Letter} outside name_char: 3
\p{Mark} in name_char: 2,501
\p{Mark} outside name_char: 0
\p{Number} in name_char: 1,793
\p{Number} outside name_char: 118
\p{Symbol} in name_char: 6,016
\p{Symbol} outside name_char: 2,498
\p{Punctuation} in name_char: 701
\p{Punctuation} outside name_char: 154
\p{Separator} in name_char: 1
\p{Separator} outside name_char: 18
\p{Whitespace} in name_char: 1
\p{Whitespace} outside name_char: 24
\p{Cc} in name_char: 0
\p{Cc} outside name_char: 65
\p{Cf} in name_char: 145
\p{Cf} outside name_char: 25
\p{Cs} in name_char: 0
\p{Cs} outside name_char: 2,048
\p{Co} in name_char: 0
\p{Co} outside name_char: 137,468
\p{Noncharactercodepoint} in name_char: 28
\p{Noncharactercodepoint} outside name_char: 38

macchiati avatar Jan 28 '25 02:01 macchiati

So if we include just the ASCII that unquoted-literal contains now (without bothering with the number syntax), and regularize it, we'd get:

[[A-Za-z0-9\-_.+-][^\p{Cc}\p{Cs}\p{Co}\p{NChar}\p{whitespace}\p{ascii}]]

That is,

  • Only specific ASCII: A-Za-z0-9 - _ . + - (plus any others we want to 'release')
  • All non-ASCII except:
Type Ranges
Control characters (Cc) %x0-%x1F %x7F-%x9F
Surrogates (Cs) %xD800-%xDFFF
Private-Use (Co) %xE000-%xF8FF %xF0000-%xFFFFD %x100000-%x10FFFD
Whitespace %x00A0 %x1680 %x2000 -%x200A %x2028 %x2029 %x202F %x205F %x3000
Non-characters (NChar) %xFDD0-%xFDEF
%xFFFE %xFFFF %x1FFFE %x1FFFF %x2FFFE %x2FFFF %x3FFFE %x3FFFF %x4FFFE %x4FFFF %x5FFFE %x5FFFF %x6FFFE %x6FFFF %x7FFFE %x7FFFF %x8FFFE %x8FFFF %x9FFFE %x9FFFF %xAFFFE %xAFFFF %xBFFFE %xBFFFF %xCFFFE %xCFFFF %xDFFFE %xDFFFF %xEFFFE %xEFFFF %xFFFFE %xFFFFF %x10FFFE %x10FFFF

That would be result in the following.

unquoted-char =
      %x2B          ; "+"      omit Cc %x0-1F, Whitespace " ", Ascii "!".."*"
    / %x2D-2E       ; "-".."." omit Ascii ","
    / %x30-39       ; "0".."9" omit Ascii "/"
    / %x41-5A       ; "A".."Z" omit Ascii ":".."@"
    / %x5F          ; "_"      omit Ascii "[".."^"
    / %x61-7A       ; "a".."z" omit Ascii "`"
    / %xA1-167F     ;          omit Cc %x7F-9F, Whitespace %xA0, Ascii "{".."~"
    / %x1681-1FFF   ;          omit Whitespace %x1680
    / %x200B-2027   ;          omit Whitespace %x2000-200A
    / %x202A-202E   ;          omit Whitespace %x2028-2029
    / %x2030-205E   ;          omit Whitespace %x202F
    / %x2060-2FFF   ;          omit Whitespace %x205F
    / %x3001-D7FF   ;          omit Whitespace %x3000
    / %xF900-FDCF   ;          omit Cs %xD800-DFFF, Co %xE000-F8FF
    / %xFDF0-FFFD   ;          omit NChar %xFDD0-FDEF
    / %x10000-1FFFD ;          omit NChar %xFFFE-FFFF
    / %x20000-2FFFD ;          omit NChar %x1FFFE-1FFFF
    / %x30000-3FFFD ;          omit NChar %x2FFFE-2FFFF
    / %x40000-4FFFD ;          omit NChar %x3FFFE-3FFFF
    / %x50000-5FFFD ;          omit NChar %x4FFFE-4FFFF
    / %x60000-6FFFD ;          omit NChar %x5FFFE-5FFFF
    / %x70000-7FFFD ;          omit NChar %x6FFFE-6FFFF
    / %x80000-8FFFD ;          omit NChar %x7FFFE-7FFFF
    / %x90000-9FFFD ;          omit NChar %x8FFFE-8FFFF
    / %xA0000-AFFFD ;          omit NChar %x9FFFE-9FFFF
    / %xB0000-BFFFD ;          omit NChar %xAFFFE-AFFFF
    / %xC0000-CFFFD ;          omit NChar %xBFFFE-BFFFF
    / %xD0000-DFFFD ;          omit NChar %xCFFFE-CFFFF
    / %xE0000-EFFFD ;          omit NChar %xDFFFE-DFFFF
                    ;          omit Co %xF0000-FFFFD %x100000-10FFFD, NChar %xEFFFE-EFFFF %xFFFFE-FFFFF %x10FFFE-10FFFF

macchiati avatar Jan 29 '25 00:01 macchiati

I don't agree that this closes the issue.

I was waiting on this PR before pursuing https://github.com/unicode-org/message-format-wg/issues/724#issuecomment-2620310458; that is, rationalizing the structure to exclude whitespace, and not exclude arbitrary percentages of punctuation, symbol, number, etc.

macchiati avatar Feb 12 '25 22:02 macchiati

I think the PR auto-closed this. Reopening.

aphillips avatar Feb 12 '25 22:02 aphillips

Thanks, here is an update, recalculating after #990:

name_char

ABNF UnicodeSet
ALPHA / "_"
/ %xC0-D6 / %xD8-F6 / %xF8-2FF
/ %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D
/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF
/ %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF
[[A-Za-z] {_}
\x{C0}-\x{D6} \x{D8}-\x{F6} \x{F8}-\x{2FF}
\x{370}-\x{37D} \x{37F}-\x{61B} \x{61D}-\x{1FFF} \x{200C}-\x{200D}
\x{2070}-\x{218F} \x{2C00}-\x{2FEF} \x{3001}-\x{D7FF}
\x{F900}-\x{FDCF} \x{FDF0}-\x{FFFC} \x{10000}-\x{EFFFF}]

Here is the comparison to relevant properties, updated from above.

name_char 971,630
\p{Letter} in name_char: 141,025
\p{Letter} outside name_char: 3
\p{Number} in name_char: 1,793
\p{Number} outside name_char: 118
\p{Symbol} in name_char: 6,016
\p{Symbol} outside name_char: 2,498
\p{Punctuation} in name_char: 701
\p{Punctuation} outside name_char: 154
\p{Separator} in name_char: 1
\p{Separator} outside name_char: 18
\p{Whitespace} in name_char: 1
\p{Whitespace} outside name_char: 24
\p{Cf} in name_char: 145
\p{Cf} outside name_char: 25
\p{Noncharactercodepoint} in name_char: 28
\p{Noncharactercodepoint} outside name_char: 38

name_nonstart

This is (name_char - name_start). I am showing the name_nonstart characters because that is more informative. They are just the characters in name-char = name-start / …

ABNF UnicodeSet
DIGIT / "-" / "."
/ %xB7 / %x300-36F / %x203F-2040
[[0-9] {-} {.}
\x{B7} \x{300}-\x{36F} \x{203F}-\x{2040}]

Here is the comparison to relevant properties. I didn't include them before, but they are worth looking at.

name_nonstart 127
\p{Mark} in name_nonstart: 112
\p{Mark} outside name_nonstart: 2,389
\p{Number} in name_nonstart: 10
\p{Number} outside name_nonstart: 1,901
\p{Punctuation} in name_nonstart: 5
\p{Punctuation} outside name_nonstart: 850

DIGIT / "-" / "." make sense to restrict as initial characters. But the others don't:

  • If the goal is to not have names start with combining marks, the current status is a miserable failure (it only excludes 4.7% of the characters). And why bother with excluding them? We should just document that although the syntax allows them, they are not recommended.
  • Why exclude · U+00B7 MIDDLE DOT, when there are 12 very similar characters?
  • Why exclude ‿ U+203F UNDERTIE and ⁀ U+2040 CHARACTER TIE when there are 8 other connector punctuation?

I'll recalculate what name_char and name_start would be if taking the above into account.

macchiati avatar Feb 13 '25 00:02 macchiati

Here's a first cut at a revision:

name-start = ALPHA
    / %x2B          ; 【+】      omit Cc %x0-1F, Whitespace 【 】, Ascii 【!"#$%&'()*】
    / %x5F          ; 【_】      omit Ascii 【,-./0123456789:;<=>?@】 【[\]^】
    / %xA1-167F     ;          omit Cc %x7F-9F, Whitespace %xA0, Ascii 【`】 【{|}~】
    / %x1681-1FFF   ;          omit Whitespace %x1680
    / %x200B-2027   ;          omit Whitespace %x2000-200A
    / %x202A-202E   ;          omit Whitespace %x2028-2029
    / %x2030-205E   ;          omit Whitespace %x202F
    / %x2060-2FFF   ;          omit Whitespace %x205F
    / %x3001-D7FF   ;          omit Whitespace %x3000
    / %xF900-FDCF   ;          omit Cs %xD800-DFFF, Co %xE000-F8FF
    / %xFDF0-FFFD   ;          omit NChar %xFDD0-FDEF
    / %x10000-1FFFD ;          omit NChar %xFFFE-FFFF
    / %x20000-2FFFD ;          omit NChar %x1FFFE-1FFFF
    / %x30000-3FFFD ;          omit NChar %x2FFFE-2FFFF
    / %x40000-4FFFD ;          omit NChar %x3FFFE-3FFFF
    / %x50000-5FFFD ;          omit NChar %x4FFFE-4FFFF
    / %x60000-6FFFD ;          omit NChar %x5FFFE-5FFFF
    / %x70000-7FFFD ;          omit NChar %x6FFFE-6FFFF
    / %x80000-8FFFD ;          omit NChar %x7FFFE-7FFFF
    / %x90000-9FFFD ;          omit NChar %x8FFFE-8FFFF
    / %xA0000-AFFFD ;          omit NChar %x9FFFE-9FFFF
    / %xB0000-BFFFD ;          omit NChar %xAFFFE-AFFFF
    / %xC0000-CFFFD ;          omit NChar %xBFFFE-BFFFF
    / %xD0000-DFFFD ;          omit NChar %xCFFFE-CFFFF
    / %xE0000-EFFFD ;          omit NChar %xDFFFE-DFFFF
                    ;          omit Co %xF0000-FFFFD %x100000-10FFFD, NChar %xEFFFE-EFFFF %xFFFFE-FFFFF %x10FFFE-10FFFF
name-char  = name-start / DIGIT
    / %x2D-2E       ; 【-.】

macchiati avatar Feb 13 '25 00:02 macchiati

This seems reasonable. Parsers seem to do fine with combining marks as long as they ignore the combining class of characters following syntax characters (I give you HTML as evidence that this works). The list in XML Name is, as you note, rather pathetic, especially in hindsight. Guidance not to use dumb identifiers works better than normatively requiring every implementation to check every code point.

(We will also need guidance that not every variable name can be composed in every runtime environment and that users should be careful to use consistent code point sequences/normalization to ensure matching works as intended, cf. https://www.w3.org/TR/charmod-norm)

One change I might suggest would be subtracting the bidi formatting characters, notably the isolates, since our syntax permits them in whitespace to make messages that contain bidi visually manageable.

aphillips avatar Feb 13 '25 00:02 aphillips

Good point about bidi controls.

I'd recommend removing all of them (\p{bidi_control})

So:

name-start = ALPHA
    / %x2B          ; 【+】      omit Cc %x0-1F, Whitespace 【 】, Ascii 【!"#$%&'()*】
    / %x5F          ; 【_】      omit Ascii 【,-./0123456789:;<=>?@】 【[\]^】
    / %xA1-61B      ;          omit Cc %x7F-9F, Whitespace %xA0, Ascii 【`】 【{|}~】
    / %x61D-167F    ;          omit BidiControl %x61C
    / %x1681-1FFF   ;          omit Whitespace %x1680
    / %x200B-200D   ;          omit Whitespace %x2000-200A
    / %x2010-2027   ;          omit BidiControl %x200E-200F
    / %x2030-205E   ;          omit Whitespace %x2028-2029 %x202F, BidiControl %x202A-202E
    / %x2060-2065   ;          omit Whitespace %x205F
    / %x206A-2FFF   ;          omit BidiControl %x2066-2069
    / %x3001-D7FF   ;          omit Whitespace %x3000
    / %xF900-FDCF   ;          omit Cs %xD800-DFFF, Co %xE000-F8FF
    / %xFDF0-FFFD   ;          omit NChar %xFDD0-FDEF
    / %x10000-1FFFD ;          omit NChar %xFFFE-FFFF
    / %x20000-2FFFD ;          omit NChar %x1FFFE-1FFFF
    / %x30000-3FFFD ;          omit NChar %x2FFFE-2FFFF
    / %x40000-4FFFD ;          omit NChar %x3FFFE-3FFFF
    / %x50000-5FFFD ;          omit NChar %x4FFFE-4FFFF
    / %x60000-6FFFD ;          omit NChar %x5FFFE-5FFFF
    / %x70000-7FFFD ;          omit NChar %x6FFFE-6FFFF
    / %x80000-8FFFD ;          omit NChar %x7FFFE-7FFFF
    / %x90000-9FFFD ;          omit NChar %x8FFFE-8FFFF
    / %xA0000-AFFFD ;          omit NChar %x9FFFE-9FFFF
    / %xB0000-BFFFD ;          omit NChar %xAFFFE-AFFFF
    / %xC0000-CFFFD ;          omit NChar %xBFFFE-BFFFF
    / %xD0000-DFFFD ;          omit NChar %xCFFFE-CFFFF
    / %xE0000-EFFFD ;          omit NChar %xDFFFE-DFFFF
                    ;          omit Co %xF0000-FFFFD %x100000-10FFFD, NChar %xEFFFE-EFFFF %xFFFFE-FFFFF %x10FFFE-10FFFF


name-char  = name-start / DIGIT
    / %x2D-2E       ; 【-.】    omit Cc %x0-1F, Whitespace 【 】, Ascii 【!"#$%&'()*+,】

macchiati avatar Feb 13 '25 01:02 macchiati

As for guidance, I suggest we have something like the following:

Syntactically, the definition of identifier provides backwards compatibility over time by allowing a stable, wide range of characters. So when there is a new character in a version of Unicode, it can be used in any conformant implementation of Message Format.

However, when function implementations and message authors are creating new identifiers (for functions, options, variables, …), it is strongly recommended that they follow at least the following:

  1. the Unicode Default Identifier Syntax
  2. the Unicode General Security Profile for Identifiers

macchiati avatar Feb 13 '25 01:02 macchiati

Removing resolve-candidate, since obviously we're still talking about it 😄

aphillips avatar Feb 14 '25 16:02 aphillips