message-format-wg
message-format-wg copied to clipboard
Collapse all escape sequence rules into one
Currently, we are very strict about which characters may be escaped, and where. This means that in the syntax we have https://github.com/unicode-org/message-format-wg/blob/6d7b4ba213e686ff2d403d3025d38d76b42b75f7/spec/message.abnf#L103-L105
As discussed in #635 and during the 2024-03-18 call, this could be simplified by allowing each of the characters \, {, |, } to be escaped in all the positions we allow for any of them to be escaped. Doing so would simplify the syntax, make escaping easier to understand for users, and simplify implementations.
This relaxation would come with the small cost of making the messages this|that and this\|that synonymous, much like we already allow for hello and {{hello}} to be synonymous.
This PR is not intended for consideration for the LDML 45 release of MF2, but after that.
I think there would need to be spec text about this as well as the ABNF mods.
If we made this change, we'd also need to change lines 87-88 in the ABNF to exclude |
Current:
; Restrictions on characters in various contexts
simple-start-char = content-char / s / "@" / "|"
text-char = content-char / s / "." / "@" / "|"
Replace with:
; Restrictions on characters in various contexts
simple-start-char = content-char / s / "@"
text-char = content-char / s / "." / "@"
I think there would need to be spec text about this as well as the ABNF mods.
That's included; the only new thing that's required is this addition: https://github.com/unicode-org/message-format-wg/blob/df210399f85b9e671ad3ac0fa70d1c255330f2bc/spec/syntax.md?plain=1#L917
That's because we already have this in Literal Resolution: https://github.com/unicode-org/message-format-wg/blob/df210399f85b9e671ad3ac0fa70d1c255330f2bc/spec/formatting.md?plain=1#L171-L173
If we made this change, we'd also need to change lines 87-88 in the ABNF to exclude
|
We don't need to do that; as I mention in the first comment above, | can still be allowed in patterns, so the messages this|that and this\|that are synonymous. We don't need to make the first variant invalid.
This relaxation would come with the small cost of making the messages
this|thatandthis\|thatsynonymous, much like we already allow forhelloand{{hello}}to be synonymous.
It would also mean that a |{foo}| literal could be spelled as |\{foo\}|.
Doing so would simplify the syntax, make escaping easier to understand for users, and simplify implementations.
Overall, I'm leaning against this change. It's much easier for me personally to remember which characters need escaping by recalling what the delimiters are. There's a clear one-to-one correspondence in the current design: In {{patterns}} you escape \{ and \}, and in |literals| you escape \|.
With the proposed change, there are now multiple spellings of the same content. It may be simple to write for some, but it will also confuse readers who don't know the specifics of the syntax by heart.
Furthermore, we invested a lot of effort to avoid slashes as much as we could, because we target multiple different host formats in which the backslash must be escaped (with another backslash). I don't think the convenience of parser implementors should have a higher priority.
Overall, I'm leaning against this change. It's much easier for me personally to remember which characters need escaping by recalling what the delimiters are
I'm kind of in the same boat.
Except for the reserved body syntax, which seems very clunky, and has its own escape.
Not only that, but the reserved body can contain |...| escapes.
Does not have delimiters of its own, depends on context.
And the context can go several levels in (in pattern (|...| inside a reserved because it is after a .fooo inside a placeholder which can be pattern, which can be in a {{...}} or not).
So I would rather (partially) fix this by improving the reserved syntax.
Does this change simply allow these characters to be escaped without effect or require it?
It's the former: this|that is also allowed to be spelled as this\|that.
The vertical bar | remains a valid character inside patterns, and the curly braces {, } remain valid characters inside literals:
text-char = content-char / s / "." / "@" / "|"
quoted-char = content-char / s / "." / "@" / "{" / "}"
Right, so in that case it seems simply "recalling what the delimiters are" still works, from a message-writing perspective.
I understand javascript certainly won't be the only implementation but I personally wouldn't expect it to work any differently than:
'\{\|' === '{|'; // true
@eemeli Do you want to clean this up so that we can consider it for merge?
In the 2024-05-06 call we agreed that we were waiting on a review from @stasm followed by some additional discussion