message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

Fix whitespace conformance to match UAX31 (including permitting LRM/RLM)

Open aphillips opened this issue 1 year ago • 10 comments

This partially addresses #661 by allowing the LRM character in message whitespace. This is whitespace outside pattern text. Tools can use this to help ensure that messages are formatted visually in a way consistent with LTR presentation of a message.

aphillips avatar Feb 19 '24 18:02 aphillips

@eggrobin For review

aphillips avatar Feb 19 '24 18:02 aphillips

While a departure, I think it is cleaner than before....

On Tue, Feb 20, 2024 at 10:36 AM Addison Phillips @.***> wrote:

@.**** commented on this pull request.

In spec/syntax.md https://github.com/unicode-org/message-format-wg/pull/673#discussion_r1496322779 :

Inside patterns and quoted literals, whitespace is part of the content and is recorded and stored verbatim. Whitespace is not significant outside translatable text, except where required by the syntax.

+There are two whitespace productions in the syntax. +Optional whitespace is whitespace that is not required by the syntax, +but which users might want to include to increase the readability of a message. +Required whitespace is whitespace that is required by the syntax. + +Tools SHOULD generate U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK +characters where permitted by the syntax before or following identifiers, +unquoted literals, or option values that use right-to-left characters

Note that if there are any issues in the WG about this, we refrain from these changes until after the v45 release, just leaving a note that we're looking at the bidi ordering issues...

The changes are to the syntax and I think important enough to merit doing the change now--the better to stabilize the syntax. It does represent a relaxation of what is allowed in free whitespace. I would like to avoid having a lot of Tech Preview implementations reject bidi-friendlier messages in the fall.

OTOH, it does represent a departure from how we set up the s production.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/pull/673#discussion_r1496322779, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCY2JTR2URWSKJV3DDYUTUMRAVCNFSM6AAAAABDP5KVM6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQOJRGIYTONJZGQ . You are receiving this because you commented.Message ID: @.***>

macchiati avatar Feb 20 '24 18:02 macchiati

I'm a little puzzled about the explicit choice made here of only allowing RLM and LRM, as opposed to other directional formatting characters. Why is that preferable here to LRI/RLI/FSI/PDI, which we're exclusively using in our default bidi isolation strategy??

I've not spent very long (yet...) with bidi concerns, but one aspect that I'm concerned with is the implementation and understanding of the recommendations added by this change. To me, isolates seems more in line with the shape of MF2 syntax and its nestings of code and message text. As a programmer, they're also somewhat easier to reason about as their effects are more direct and, well, isolated.

I'm also a bit concerned about the effects that our allowance for RTL names and identifiers may have, esp. when they are mixed in with LTR names and identifiers, and our general allowance for newlines in whitespace.

Rather than allowing directional formatting characters in any/all whitespace, could we be much more restrictive about which characters we allow, and where? And also perhaps include some explicit guidance about the preferred rendering order for syntax, such as the LTR order within expressions of operand > function name > options > attributes?

For instance, would it be appropriate to only allow the following?

  1. LRM after a syntax whitespace newline
  2. LRI/RLI/FSI before the quoted-pattern start {{
  3. PDI after the quoted-pattern end}}
  4. LRI after the expression or markup start {
  5. PDI before the expression or markup end }
  6. LRI/RLI/FSI before variable or literal
  7. LRI/RLI/FSI before the sigil (if any) prefixing an identifier
  8. PDI after variable, literal, or identifier

The intent with the above would be to ensure that it's possible to have a valid message for which the "code" portions always have an LTR paragraph direction, while allowing for all user-customizable strings to define their own direction.

eemeli avatar Feb 21 '24 11:02 eemeli

@eemeli That's a good point.

Your list would cover the isolation requirements, albeit being somewhat hard to specify. I would permit the LRM solution also (it's simpler if one is prudent about using RTL tokens in placeholders or keys).

It's a bit hard to specify in the grammar. I think we could add isolates to the various placeholder quotes ({/}) and to pattern quotes ({{/}}) and probably should do.

One part of our syntax that makes me nervous is the key array. Key values can be literals and these are only separated by whitespace. RLI/PDI around keys (or LRM after them) keeps this from happening (numbers added to show the logical order of the keys):

image

It also guards against this kind of "exchange" (the message actually has keys |1| |2 3|):

image

aphillips avatar Feb 21 '24 15:02 aphillips

It's a bit hard to specify in the grammar.

Generally that is the thrust of the recommendations of UTS55, reflecting the consensus arising from a year and a half of discussion with security experts and implementers at various levels from compilers to editors in the in the Source Code Working Group[^1]: syntaxes should allow the characters that are needed to fix things, and higher-level tooling should deal with inserting that in the right places, ensuring the stateful controls are correctly paired, etc.

Again I can see how it would make sense to add the isolates to ignorable format controls in this context; but trying to restrict it to the right places is messy to deal with in the grammar, because you have to ensure that they are properly terminated (and note of course that stateful controls that are in literal text would interact with those that you allow in whitespace).

[^1]: The SCWG was a limited-duration working group of the Properties and Algorithms Group of the Unicode Technical Committee, see UTC-170-C2.

eggrobin avatar Feb 21 '24 16:02 eggrobin

It is quite tricky, and we should not derail the tech preview release for this. That's why I (more strongly) urge that we capture this issue in a note in the spec for tech preview, and make the fix afterwards.

On Wed, Feb 21, 2024 at 8:19 AM Robin Leroy @.***> wrote:

It's a bit hard to specify in the grammar.

Generally that is the thrust of the recommendations of UTS55, reflecting the consensus arising from a year and a half of discussion with security experts and implementers at various levels from compilers to editors in the in the Source Code Working Group1 <#m_6330126636072643569_user-content-fn-1-68bea7a894d8a75b3f73274ae69aff28>: syntaxes should allow the characters that are needed to fix things, and higher-level tooling should deal with inserting that in the right places, ensuring the stateful controls are correctly paired, etc.

Again I can see how it would make sense to add the isolates to ignorable format controls in this context; but trying to restrict it to the right places is messy to deal with in the grammar, because you have to ensure that they are properly terminated (and note of course that stateful controls that are in literal text would interact with those that you allow in whitespace). Footnotes

The SCWG was a limited-duration working group of the Properties and Algorithms Group of the Unicode Technical Committee, see UTC-170-C2 https://www.unicode.org/L2/L2022/22016.htm#170-C2. ↩ <#m_6330126636072643569_user-content-fnref-1-68bea7a894d8a75b3f73274ae69aff28>

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/pull/673#issuecomment-1957169657, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMC6VK4BPDA2TDNNMCTYUYNBHAVCNFSM6AAAAABDP5KVM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJXGE3DSNRVG4 . You are receiving this because you commented.Message ID: @.***>

macchiati avatar Feb 21 '24 16:02 macchiati

A short term fix would be to permit the isolates around whitespace (without attempting to forcibly pair them) and to permit (correct side) decoration of the placeholder and pattern delimiters with LRI/PDI. I'm noodling around with it in another browser tab before proposing it. Otherwise, yes, this should be a Tech Preview item with a note in the specification. I will add the note to this PR also.

aphillips avatar Feb 21 '24 16:02 aphillips

Looks great. Will approve once I'm back at my computer

On Wed, Feb 21, 2024, 12:08 Addison Phillips @.***> wrote:

@aphillips https://github.com/aphillips requested your review on: #673 https://github.com/unicode-org/message-format-wg/pull/673 Fix whitespace conformance to match UAX31 (including permitting LRM/RLM).

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/pull/673#event-11881590692, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCFWF4UXF4P4SCKXSTYUZH43AVCNFSM6AAAAABDP5KVM6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRHA4DCNJZGA3DSMQ . You are receiving this because your review was requested.Message ID: @.*** com>

macchiati avatar Feb 21 '24 21:02 macchiati

@mihnita noted:

Looks OK, but incomplete. We should also look at isolates, and it is a bit too close to deadline.

Can you clarify? What is incomplete? Also, this uses isolates, so your last comment is mysterious to me.

I agree that the deadlines are an issue. My concern here is that (a) we are a Unicode WG and this is a Unicode set of requirements. Even if we don't include it in 45, we need to deal with it and (b) syntax stability is important to me. Permitting bidi controls now in a somewhat (but not entirely!!) loose manner will prevent unnecessary churn later.

Anyway, to discuss in a few minutes in our call 😉

aphillips avatar Feb 26 '24 17:02 aphillips

WG will consider in post-45. @aphillips to create a 45-timed PR with a note.

aphillips avatar Feb 26 '24 19:02 aphillips

Closing as obsolete

aphillips avatar Sep 11 '24 16:09 aphillips