message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

Collapse all escape sequence rules into one

Open eemeli opened this issue 1 year ago • 9 comments

Currently, we are very strict about which characters may be escaped, and where. This means that in the syntax we have https://github.com/unicode-org/message-format-wg/blob/6d7b4ba213e686ff2d403d3025d38d76b42b75f7/spec/message.abnf#L103-L105

As discussed in #635 and during the 2024-03-18 call, this could be simplified by allowing each of the characters \, {, |, } to be escaped in all the positions we allow for any of them to be escaped. Doing so would simplify the syntax, make escaping easier to understand for users, and simplify implementations.

This relaxation would come with the small cost of making the messages this|that and this\|that synonymous, much like we already allow for hello and {{hello}} to be synonymous.

This PR is not intended for consideration for the LDML 45 release of MF2, but after that.

eemeli avatar Mar 23 '24 10:03 eemeli

I think there would need to be spec text about this as well as the ABNF mods.

aphillips avatar Mar 23 '24 15:03 aphillips

If we made this change, we'd also need to change lines 87-88 in the ABNF to exclude |

Current:

; Restrictions on characters in various contexts
simple-start-char = content-char / s / "@" / "|"
text-char         = content-char / s / "." / "@" / "|"

Replace with:

; Restrictions on characters in various contexts
simple-start-char = content-char / s / "@"
text-char         = content-char / s / "." / "@"

aphillips avatar Mar 23 '24 16:03 aphillips

I think there would need to be spec text about this as well as the ABNF mods.

That's included; the only new thing that's required is this addition: https://github.com/unicode-org/message-format-wg/blob/df210399f85b9e671ad3ac0fa70d1c255330f2bc/spec/syntax.md?plain=1#L917

That's because we already have this in Literal Resolution: https://github.com/unicode-org/message-format-wg/blob/df210399f85b9e671ad3ac0fa70d1c255330f2bc/spec/formatting.md?plain=1#L171-L173

If we made this change, we'd also need to change lines 87-88 in the ABNF to exclude |

We don't need to do that; as I mention in the first comment above, | can still be allowed in patterns, so the messages this|that and this\|that are synonymous. We don't need to make the first variant invalid.

eemeli avatar Mar 23 '24 17:03 eemeli

This relaxation would come with the small cost of making the messages this|that and this\|that synonymous, much like we already allow for hello and {{hello}} to be synonymous.

It would also mean that a |{foo}| literal could be spelled as |\{foo\}|.


Doing so would simplify the syntax, make escaping easier to understand for users, and simplify implementations.

Overall, I'm leaning against this change. It's much easier for me personally to remember which characters need escaping by recalling what the delimiters are. There's a clear one-to-one correspondence in the current design: In {{patterns}} you escape \{ and \}, and in |literals| you escape \|.

With the proposed change, there are now multiple spellings of the same content. It may be simple to write for some, but it will also confuse readers who don't know the specifics of the syntax by heart.


Furthermore, we invested a lot of effort to avoid slashes as much as we could, because we target multiple different host formats in which the backslash must be escaped (with another backslash). I don't think the convenience of parser implementors should have a higher priority.

stasm avatar Mar 25 '24 16:03 stasm

Overall, I'm leaning against this change. It's much easier for me personally to remember which characters need escaping by recalling what the delimiters are

I'm kind of in the same boat.

Except for the reserved body syntax, which seems very clunky, and has its own escape. Not only that, but the reserved body can contain |...| escapes. Does not have delimiters of its own, depends on context. And the context can go several levels in (in pattern (|...| inside a reserved because it is after a .fooo inside a placeholder which can be pattern, which can be in a {{...}} or not).

So I would rather (partially) fix this by improving the reserved syntax.

mihnita avatar Mar 25 '24 16:03 mihnita

Does this change simply allow these characters to be escaped without effect or require it?

bearfriend avatar Mar 25 '24 17:03 bearfriend

It's the former: this|that is also allowed to be spelled as this\|that.

The vertical bar | remains a valid character inside patterns, and the curly braces {, } remain valid characters inside literals:

text-char         = content-char / s / "." / "@" / "|"
quoted-char       = content-char / s / "." / "@" / "{" / "}"

stasm avatar Mar 25 '24 17:03 stasm

Right, so in that case it seems simply "recalling what the delimiters are" still works, from a message-writing perspective.

I understand javascript certainly won't be the only implementation but I personally wouldn't expect it to work any differently than:

'\{\|' === '{|'; // true

bearfriend avatar Mar 25 '24 17:03 bearfriend

@eemeli Do you want to clean this up so that we can consider it for merge?

aphillips avatar Apr 15 '24 17:04 aphillips

In the 2024-05-06 call we agreed that we were waiting on a review from @stasm followed by some additional discussion

aphillips avatar May 12 '24 16:05 aphillips