message-format-wg
message-format-wg copied to clipboard
Fallbacks for variables in syntactically invalid declarations
Consider the following syntactically incorrect message:
.local $bar {|foo|} {{{$bar}}}
This is incorrect because the "=" between $bar and {|foo|} is missing.
The spec requires the implementation to signal a syntax error, but also provide a partial result.
Should the result of formatting be foo since the right-hand side of the declaration is syntactically correct, although the declaration itself isn't? Or should it be {$bar} because no valid declaration for $bar exists?
There is a similar test case in https://github.com/unicode-org/message-format-wg/blob/main/test/syntax-errors.json#L50 , but this file doesn't specify formatting output for syntactically invalid messages.
The basic issue is that this text from formatting.md:
In a declaration, the resolved value of the expression is bound to a variable, which is available for use by later expressions. Since a variable can be referenced in different ways later, implementations SHOULD NOT immediately fully format the value for output.
doesn't say whether to bind a variable to a resolved value if an error is encountered while processing the declaration.
I'll note that a previous version of the messageformat package included a test file that included the following test:
{
"src": "let $bar {|foo|} {{$bar}}",
"exp": "foo",
"errors": [{ "type": "missing-char" }]
},
which implies that $bar should be bound to a value in the environment even though processing its declaration should cause the formatter to emit a syntax error. However, I don't see a basis for this (or the other interpretation) in the spec.
It's a syntax error. We say this about syntax errors:
Syntax Errors and Data Model Errors apply to all message processors, and MUST be emitted as soon as possible. The other error categories are only emitted during formatting, but it might be possible to detect them with validation tools.
You'll also find buried in the formatting spec this instruction:
If the message being formatted has any Syntax Errors or Data Model Errors, the result of pattern selection MUST be a pattern resolving to a single fallback value using the message's fallback string defined in the formatting context or if this is not available or empty, the U+FFFD REPLACEMENT CHARACTER �.
That is, in this case, we emit the logo. While a human might intuit that the rest of the example message is intact and it's only missing an equals sign, there's no way for the parser to know for sure. It's also not possible for the parser to know if other Very Bad Things might happen if $bar were given some fallback value.
I do find that it's confusing that the above quoted paragraph (the "If the message..." one) is not more prominent.
You'll also find buried in the formatting spec this instruction:
Since that text is under the "Pattern Selection" header, it could be read as only applying to a selectors message. (It could also be read the other way, since selecting a pattern from a set containing 1 pattern is also pattern selection, but I think there's no obvious reading of it.)
I agree. A syntax error can occur in a simple message:
This has a {$syntax :error
... so we should clarify that the quoted paragraph applies to all syntax problems.
Probably the right prominent place for this is the formatting intro, where we need to replace this MAY with MUST: https://github.com/unicode-org/message-format-wg/blob/38fdd698e425e165047ec7d3a53e03459a8d5cb5/spec/formatting.md?plain=1#L10-L11
... or we could just get rid of "MAY be" (or convert to SHOULD).
Some syntax errors are recoverable (at least to the point of not requiring the logo be emitted for the whole message)--if the error is restricted to the inside of an expression, the expression would fall back without escaping to obliterate the message. With lazy evaluation, the message as a whole might even work for some inputs. It's probably better to blow up so that the message gets fixed in pre-prod, but we don't have to require it. Should we allow that?
Anyway, should we attempt to clarify pre-preview?
@catamorphism do you agree that this is the right behavior?
@catamorphism do you agree that this is the right behavior?
I'm not sure what "this" is :)
I can read the current spec in several different ways (let's just focus on syntax errors, for simplicity):
- Any syntax error anywhere in the message means the whole message should format to the single fallback value.
- Same as (1), except that a syntax error encountered while parsing an expression MAY be recovered from (by using a fallback for the syntactically incorrect expression, but not for the entire message).
- Same as (2), but with MUST instead of MAY.
- (1) but allowing lazy evaluation: syntax errors in expressions, declarations or patterns that are not needed for the formatting of the chosen pattern MAY be ignored.
- (2) but allowing lazy evaluation: "ignored if the expression's value is never demanded" instead of "recovered from"
(5) is what allows an implementation to make the set of error-free messages as large as possible.
Sorry I wasn't clear. The proposal is that syntax/data model errors fail the message outright.
I agree that expression-internal syntax errors are sometimes recoverable and that we can allow these to be evaluated lazily (only parsed when needed) or that we can allow expression internal error to fallback appropriately or do something implementation defined. This sort of fallback is described in the expression resolution section, btw.
(5) is what allows an implementation to make the set of error-free messages as large as possible.
That isn't quite accurate. It allows an implementation to make the set of error-free formatted messages as large as possible, at the cost of potentially hiding errors in a way that can appear to be "transient". I think it is worth careful consideration, since most parsers will run over the whole message, not skipping patterns until needed. We have a separate error (Invalid Expression) for problems inside of an expression.
I agree that hiding errors isn't necessarily a good thing! Option (1) is the most consistent with all programming languages I'm aware of (all syntax errors fail the program). And I personally support it; parsing the syntax is complicated enough without the requirement to recover from syntax errors. If there are good error diagnostics, I don't think providing partial output adds much.
(That being said, ICU4C makes it hard to give good error diagnostics beyond line and column numbers, and adding better diagnostics is unlikely to happen for the tech preview.)
I think this has been addressed by subsequent spec changes to the effect that syntax errors are non-recoverable.