message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

Well-formed vs valid

Open macchiati opened this issue 1 year ago • 13 comments

Added text 2024-11-24


I think we need to be careful about our usage of the terms 'well-formed' and 'valid'. The following is not fully fleshed out; it is more of a discussion of the issue and some ideas for the future.

We often reference other sources for identifiers, and want them to be interpreted according to that source. Sources that change over time should (and typically do) distinguish between well-formed and valid. For example, 'ge:manic' is not a well-formed locale identifier, and 'de-Flub' is not a valid locale identifier. However, 'de-Flub' could (conceivably) become valid in the future, if a script is given the code 'Flub'. Good sources also never remove identifiers, or make material changes in the meaning, but may deprecate them: those are still treated as valid.

When we reference such sources in message format, such as with option values, we have a few goals.

  • Ideally, implementations could only accept well-formed and valid identifiers, and only interpret them according to the source semantics. For example, interpret 'de' as German and not as Dezfuli.
  • However, we don't want to force implementations to break if they don't support all the identifiers, nor if they don't support the latest version, or if they support an identifier that has become deprecated.

This is also true for our own enums, . We have in registry.md:

Implementations MAY accept additional option values for options defined here. However, such values might become defined with a different meaning in the future, including with a different, incompatible name or using an incompatible value space. Supporting implementation-specific option values for standard or optional functions is NOT RECOMMENDED.

We also have BNF:

option = identifier o "=" o (literal / variable)

The implications are that conformant implementation can interpret any of:

{$x :currency compactDisplay=short} {$x :currency compactDisplay=medium} {$x :currency compactDisplay=μικρός} {$x :currency compactDisplay=|🐭|} {$x :currency compactDisplay=$myDisplay}

It can also interpret:

{$x :currency currency=CAD} {$x :currency currency=MyCurrency} {$x :currency currency=δολάριοΚαναδά} {$x :currency currency=|¥|} {$x :currency currency=|🐭|} {$x :currency currency=$myCurrency}

It could also interpret compactDisplay=short by formatting a long form, and compactDisplay=long by formatting a short form. Or a value of CAD as being GBP, etc.

This level of freedom seems counterproductive for interoperability.


So I propose that we have the general rule something like the following, where option values are defined according to a reference to an external source

  • An implementation MUST ignore any option with a literal option value that is ill-formed according to its external source, and signal that error. This allows linters and message builders to catch ill-formed values early.
    • [It must ignore the option locale=|ge:manic|]
  • An implementation MUST ignore any option with an option value that isn't valid according to any version of the external source.
    • [At the time of this writing, must ignore locale=|dab|]
  • An implementation SHOULD (but need not) ignore an option with an option value that is valid according to some version of the external source.
    • [An implementation might not support Dezfuli, and thus ignore locale=|def|; it may also ignore all deprecated language identifiers, and thus ignore locale=|daf|.]
  • If an implementation doesn't ignore an option, then it MUST interpret its option value in accordance with some version of the source.
    • [It must not interpret 'de' as Dezfuli, or 'def' as German.]

Ignore means that the expression is interpreted as if the option were not there. (I won't talk here about what signals to the caller are associated with that.)


I think we could apply that to our standard enum option values, such as the following in https://github.com/unicode-org/message-format-wg/blob/main/spec/registry.md#options-1, so that |@!$| could be recognized as ill-formed.

  • useGrouping
    • auto (default)
    • always
    • never
    • min2

That is, perhaps we can have a rule in the registry for our functions, something like: the default well-formedness criteria for standard function option values matches the constraints on function option identifiers in README.md. Thus |$abc| would be ill-formed for useGrouping. Any function option that had different criteria for well-formedness of its values would simply have have an explicit well-formedness statement.


macchiati avatar Nov 14 '24 17:11 macchiati

A few notes:

  • be careful not to confuse "implementation" and "function handler". Many option values are handled by the function handler and not necessary by the MF2 implementation itself. The MF2 implementation frequently does not validate the option values, even though the spec defines what a well-formed option contains.
  • we don't "MUST ignore" options whose values are ill-formed for some of the reference sources because we allow for implementation defined values. For example, MSFT might allow LCIDs as a value for u:locale=$lcid while ORCL might permit legacy ids like AMERICAN_AMERICA 😱 . Obviously, for non-implementation-defined extensions, the result SHOULD be ignored.

Note that we have text about option resolution in the spec which does indeed drop bad options on the floor. But for options whose interpretation is inside the function handler, the dropping-on-the-floor part is up to the function itself. This is why there is a resolved options section in each function: it defines which options are visible downstream (functions don't currently eat any of their options)

Are there specific changes you want in the spec? I'd advise a careful look at u-namespace.md and registry.md as well as option resolution in formatting.md.

aphillips avatar Nov 14 '24 17:11 aphillips

I was struck by the fact that we are requiring valid for some identifiers (eg timezones), but only well-formed for currencies. Those feel like very similar cases, so if well-formed is right for currency, that term should also be right for timezones (or the inverse).

we don't "MUST ignore" options whose values are ill-formed for some of the reference sources because we allow for implementation defined values

But a straightforward reading of registry.md means that we don't allow that in many cases (whenever we say well-formed (like currencies) or valid like:

timeZone (default is system default time zone or UTC)

But that means I can't use implementation-defined identifiers like "$California Time"

macchiati avatar Nov 14 '24 18:11 macchiati

No, you're correct about this. We should be well-formed for acceptance but permit checking for validity. And we should fix values to permit implementation-specific gorp (mainly for platform-specific values that aren't the sanctioned identifiers)

aphillips avatar Nov 14 '24 18:11 aphillips

I think what we should do is: merge #911 and #922 and then do a cleanup edit on registry.md in a new PR

aphillips avatar Nov 14 '24 18:11 aphillips

makes perfect sense

macchiati avatar Nov 14 '24 22:11 macchiati

Mark (@macchiati) wrote:

  • An implementation MUST ignore any option with an option value that is ill-formed according to its source.

    • [It must ignore the option locale=|ge:manic|]
  • An implementation MUST ignore any option with an option value that isn't valid according to any version of the source. [At the time of this writing, must ignore locale=|dab|]

  • An implementation SHOULD (but need not) ignore an option with an option value that is valid according to some version of the source. [An implementation might not support Dezfuli, and thus ignore locale=|def|; it may also ignore all deprecated language identifiers, and thus ignore locale=|daf|.]

I think the SHOULD in this paragraph should be a MAY, for obvious reasons.

duerst avatar Nov 18 '24 02:11 duerst

This was discussed in the 2024-11-18 call. We resolved to use valid in most cases, but with careful phrasing in the boilerplate. I believe this is now addressed?

aphillips avatar Nov 20 '24 19:11 aphillips

I elaborated a bit. I would like to discuss further, after 46.1

macchiati avatar Nov 22 '24 22:11 macchiati

I see your elaboration. One callout:

"implementation" has to be used carefully. In most cases in our spec it refers to the MessageFormat framework/executable/host environment itself, e.g. in ICU4J the actual MessageFormatter class. And it is true that the ABNF and well-formed/validity rules at the message level are quite permissive about option values.

At the function set level, there is a different layer of "implementation", specifically what we call the function handler. This is what a lot of the normative language in the current registry.md is about. In general, the function handler is some code that maps option values to local API-specific representations. So for "digit size options", it parses the option value. If it's a positive integer, great. Otherwise it's not valid.

We definitely want to impose standards on options and their values, to ensure interoperability. But the MF2-level implementation has no role in this (once the message is syntactically correct). Instead, the specific function handler, such as for :integer or whatever is involved. Thus the wording needs to be precise about where the "implementation" is taking place. And it needs to not impose such restrictions as would limit extensibility or prevent the correct level in the code from receiving the information.

aphillips avatar Nov 22 '24 23:11 aphillips

I agree that there are important distinctions to be made, and in any final text we should make it clear. What I'm specifically talking about are the implementations of the standard functions defined in the registry.md. Whatever we do, it should be clear what kinds of results we can expect to have, and what kinds of errors we can expect to see raised (which might be different for ill-formed vs well-formed+invalid vs well-formed+valid+unsupported vs well-formed+valid+supported).

Some of that could apply to implementation-defined functions, but I didn't want to talk about that in this issue.

macchiati avatar Nov 22 '24 23:11 macchiati

I would still like to discuss this further. For the standard functions, we do not do interoperability any favors by not differentiating between the following. As the user, I would want a linter or precompiler to know that:

  1. none of them are valid
  2. the first 2 are well-formed (and could be valid in a future version)
  3. the second 2 are ill-formed (and couldn't be valid in a future version).
{$x :currency compactDisplay=short}
{$x :currency compactDisplay=medium}
{$x :currency compactDisplay=μικρός}
{$x :currency compactDisplay=|🐭|}

For reference, the listed values are:

  • currencyDisplay
    • narrowSymbol
    • symbol (default)
    • name
    • code
    • formalSymbol
    • never (this is called hidden in ICU)

So I think we should make the validity/well-formedness distinction for all the standard functions, and recommend it for non-standard functinons.

macchiati avatar Dec 09 '24 19:12 macchiati

@macchiati asked for Agenda+ on this item via email.

aphillips avatar Jan 10 '25 21:01 aphillips

In the 2025-02-03 call we agreed that this feature depends on developing machine-readable function definitions, a deliverable that is not in v47 scope. I am removing Agenda+ and moving to v48 as a result. Please add Agenda+ when this should be discussed again (it will not appear in the 2025-02-17 agenda)

aphillips avatar Feb 14 '25 17:02 aphillips