message-format-wg formatToParts-like Iterator

Is your feature request related to a problem? Please describe. Rather than format a message to a string, it's sometimes useful to work with an iterable of formatted parts. This is conceptually similar to NumberFormat.prototype.formatToParts and others. This approach allows greater flexibility on the side of the calling code, and opens up opportunities to design tighter integrations with other libraries.

For instance, in the message Hello, {$userName}!, the userName variable could be a UI widget, e.g. a React component. The code integrating the localization into the React app could then call formatToParts which would yield the React component unstringified, ready to be used in render().

Another example: It is {$currentTime} where $currentTime is a Date object formatted as 18:03. If formatToParts yields it unformatted, the calling code can then call DateTimeFormat.prototype.formatToParts on it and add custom markup to achieve effects like 18:03 (hours in bold).

Describe the solution you'd like The formatToParts iterator would yield parts after all selection logic runs (in MF terms, that's after plural and select chooses one of the supplied variants) and after the variables are resolved to their runtime values, but before ~~variables are resolved to their values,~~ they are formated to strings, and interpolated into the message.

Describe why your solution should shape the standard It's an API offering a lot of flexibility to its consumers. The regular format API returning a string can be implemented on top of it, too.

Additional context or examples Fluent is now considering formatToParts in https://github.com/projectfluent/fluent.js/issues/383 and https://github.com/projectfluent/fluent/issues/273. I expect it to be ready by the end of H1. We see it as a great way of allowing interesting use-cases like component interpolation mentioned above, as well as an alternative approach to handle BiDi isolation (see #28) and to support custom transformation functions for text (great for implementing pseudolocalizations).

Feb 11 '20 17:02 stasm

+100

Feb 15 '20 19:02 mihnita

@stasm How would the "before variables are resolved to their values" bit work? I definitely agree with having parts including the value before its stringified, but what would be the benefit of not already determining the part's original value?

This matters when you consider format-to-parts output together with post-resolution transformations that might enable solutions to #16, #31, #34, and #160. If the parts are emitted before variable resolution, such a transformation could not be applied to them.

May 02 '21 12:05 eemeli

Here's a concrete proposal for the interfaces of the formatted message parts, where the return type of a formatToParts() method or function would be FormattedPart[]:

interface FormattedDynamic<T> {
  type: 'dynamic';
  value: T;
  meta?: FormattedMeta;
  toString(): string;
  valueOf(): T;
}

interface FormattedFallback {
  type: 'fallback';
  value: string;
  meta?: FormattedMeta;
  toString(): string;
  valueOf(): string;
}

interface FormattedLiteral {
  type: 'literal';
  value: string | number;
  meta?: FormattedMeta;
  toString(): string;
  valueOf(): string | number;
}

interface FormattedMessage<T> {
  type: 'message';
  value: FormattedPart<T>[];
  meta?: FormattedMeta;
  toString(): string;
  valueOf(): string;
}

type FormattedMeta = Record<string, string | number | boolean | null>;

type FormattedPart<T = unknown> =
  | FormattedDynamic<T>
  | FormattedFallback
  | FormattedMessage<T>
  | (T extends string | number ? FormattedLiteral : never);

In other words, I'm proposing that we have four different formatted parts:

literal strings and numbers are directly defined in the source message that's being formatted.
dynamic values are those returned by custom formatting functions, or from runtime variables. Their types are not defined or forced by the spec.
message is a resolved message reference, containing a list of message parts.
fallback is used when errors occur, e.g. a formatting function fails for some reason. Its value should be a syntax-like representation of the expression that caused the error.

The fields of these parts are shared by all, and each has an important role:

type identifies what sort of part this is
value is the raw value of the part
meta, if defined, holds metadata about the value. For example, it may contain an identifier for its gender, or an error_message for a fallback part.
toString() is used to stringify each part in a locale-appropriate manner. For message this means concatenating its stringified parts; for fallback the value is wrapped as '{value}'.
valueOf() provides a singular representation of the part's value, for use without the other information included in the part. For dynamic and literal that is the part's raw value, but for message and fallback the toString() output is returned.

In the execution model of the EZ model, these formatted parts may also be used to wrap the arguments of a formatting function, which would allow e.g. for {TITLECASE(-common-term)} to be formatted as the transformed message -common-term, with the whole referenced message as well as each of its parts retaining all of their metadata through the title-casing of their values. A working implementation of this is available: https://github.com/messageformat/messageformat/blob/mf2/packages/messageformat/src/format-message.ts

Aug 15 '21 15:08 eemeli

I'd prefer subclassing dynamic values to be more concrete (see @formatjs/icu-messageformat-parser as a reference. This allows us to write fairly comprehensive linter that analyzes the message and guard against TMS restrictions. The trickiest one is no-complex-selectors which stems from Smartling's restriction of complexity index cap at 20 (if you flatten all your selectors it should not yield more than 20 unique sentences).

Aug 16 '21 04:08 longlho

@longlho Your reference link and at least my understanding of your concerns would indicate that you might be talking about the representation of the source message, rather than this formatted output, where all the selectors, functions etc. have been resolved into a single sequence of formatted parts. Is this so, or have I misunderstood?

The MF2 data model representation of source messages is a separate from this, and its allowance of selectors only at the top level should make it significantly easier to e.g. count selector cases directly.

Aug 16 '21 07:08 eemeli

Ah I misunderstood this then. In that case looks like dynamic is still not enough I believe and would still like more structured data to encompass FormattedDateParts (for example) as @stasm mentioned. I think some primitive types still need to be part of the spec and can be expanded further.

Aug 16 '21 13:08 longlho

@stasm How would the "before variables are resolved to their values" bit work? I definitely agree with having parts including the value before its stringified, but what would be the benefit of not already determining the part's original value?

Sorry for missing this question back when. Looking at it today, I think I got this wrong. We should resolve the variables references to a runtime value (like the one you proposed in https://github.com/unicode-org/message-format-wg/issues/41#issuecomment-899067583) and stop there, i.e. yield those runtime values without formatting them to strings.

Aug 16 '21 16:08 stasm

A naming detail which I think may impact the understanding of the proposed interfaces:

Here's a concrete proposal for the interfaces of the formatted message parts, where the return type of a formatToParts() method or function would be FormattedPart[]:

Would FormattablePart be a better name for this? These objects store the raw value and expose the toString() method which means that the formatting still hasn't happened. FormattedPart implies the opposite, I think.

Aug 17 '21 09:08 stasm

Actually, let me take a step back. I was under the impression that we'd want to yield unformatted values, but after thinking about this this morning, I'm not so sure anymore.

Given the message: Transferred {NUMBER($fileSize, unit: "megabytes")., would we want to:

yield unformatted values:

 {value: "Transferred ", toString(): ..., *toParts(): ...}
 {value: 1.23, toString(): ..., *toParts(): /* calls NumberFormat.formatToParts */}

yield formatted parts, nested:

 StringPart {value: "Transferred "}
 NumberPart [
     NumberFormatPart { type: 'integer', value: '1' },
     NumberFormatPart { type: 'decimal', value: '.' },
     NumberFormatPart { type: 'fraction', value: '23' },
     NumberFormatPart { type: 'literal', value: ' ' },
     NumberFormatPart { type: 'unit', value: 'MB' }
 ]

yield formatted parts, flattened:

 StringPart {value: "Transferred "}
 NumberFormatPart { type: 'integer', value: '1' },
 NumberFormatPart { type: 'decimal', value: '.' },
 NumberFormatPart { type: 'fraction', value: '23' },
 NumberFormatPart { type: 'literal', value: ' ' },
 NumberFormatPart { type: 'unit', value: 'MB' }

some combination of the above, e.g. (2) but also carrying the original raw value?

Aug 17 '21 10:08 stasm

@stasm Really good point. And I think it made me change my mind on a few things.

I actually had a decently long reply to this written, but then I realised that my approach to this is premised on

Needing/wanting to account for function composition, and the compound values that it effectively requires to be supportable as function arguments due to e.g. message metadata.
Enabling lazy reference resolution, which you had also talked about.
A desire to align the format-to-parts API with the formatting function argument API.

If instead we skip all of that and require eager resolution for formatting functions args, we really ought to consider alignment with the existing prior art on this as a relatively high priority, i.e. follow what ECMA-402 does. And that to me answers your question: We should go with option 3, formatted & flattened parts, adding something like the source value that formatRangeToParts includes. So something like this:

[
  { type: 'literal', value: 'Transferred ' },
  { type: 'integer', value: '1', source: 1.23 },
  { type: 'decimal', value: '.', source: 1.23 },
  { type: 'fraction', value: '23', source: 1.23 },
  { type: 'literal', value: ' ', source: 1.23 },
  { type: 'unit', value: 'MB', source: 1.23 }
]

Not sure about the exact shape of the source there, mind. The point is, it should allow distinguishing the boundary between one source and the next. I do think that for not-explicitly-formatted non-primitive variable values we ought to have something like { type: 'variable', value: { foo: 'bar' } }.

Aug 17 '21 11:08 eemeli

If instead we skip all of that and require eager resolution for formatting functions args, we really ought to consider alignment with the existing prior art on this as a relatively high priority

Can you explain how the eager vs. resolution for function arguments ties into this? In my mind in both approaches, the parts yielded by formatToParts are a transformation on the function's output. In other words, it doesn't matter when $fileSize is resolved because the part (or parts) depend on the output of NUMBER($fileSize, unit: "megabytes").

Aug 17 '21 16:08 stasm

And that to me answers your question: We should go with option 3, formatted & flattened parts, adding something like the source value that formatRangeToParts includes.

Message formatting is unique enough that we could justify the nested approach too, kind of like (2) in https://github.com/unicode-org/message-format-wg/issues/41#issuecomment-900180331.

StringPart {value: "Transferred "},
NumberPart {
    value: 1.23,
    parts: [
        NumberFormatPart { type: 'integer', value: '1' },
        NumberFormatPart { type: 'decimal', value: '.' },
        NumberFormatPart { type: 'fraction', value: '23' },
        NumberFormatPart { type: 'literal', value: ' ' },
        NumberFormatPart { type: 'unit', value: 'MB' }
    ]
}

If instead we go for a completely flat output, then I like your idea to use { type: 'literal', value: 'Transferred ' } for the string part.

Aug 17 '21 17:08 stasm

Late night revelation that I wouldn't want to forget: flat output scales better when we're talking about messages referencing other messages, possibly more than one level deep.

Aug 17 '21 22:08 stasm

Okay, updated proposal based on comments from @longlho and @stasm. I think the parts should be a flat list MessageFormatPart[] where

type MessageFormatPart = { source?: string } & (
  | { type: 'literal'; value: string }
  | { type: 'dynamic'; value: string | symbol | function | object; source: string }
  | { type: 'fallback'; value: string; source: string }
  | { type: 'meta'; value: ''; meta: Record<string, string>; source: string | undefined }
  | Intl.DateTimeFormatPart
  | Intl.ListFormatPart
  | Intl.NumberFormatPart
  | Intl.RelativeTimeFormatPart
)

The added formatted part types are the same as before, except for meta replacing message:

literal strings and numbers are directly defined in the source message that's being formatted.
dynamic values are those returned by custom formatting functions, or from runtime variables. Their types are not defined or forced by the spec.
fallback is used when errors occur, e.g. a formatting function fails for some reason. Its value should be a syntax-like representation of the expression that caused the error.
meta always has an empty-string value, and contains metadata for its associated source value. If it has an undefined source, it applies to the entire message.

The fields are also much the same as before, though source is new:

type identifies what sort of part this is
value is the string value of the part, or for dynamic, an unknown stringifiable value
meta, if defined, holds metadata about the value. For example, it may contain an identifier for its gender, or an error_message for a fallback part.
source is a string identifier for the source of the value, when it's determined by a variable, function or term, with values like '$foo', 'NUMBER($num)', and '-some-term/$bar'. It may be used to identify the common origin of a sequence of parts.

Aug 23 '21 06:08 eemeli

Thought it might be good to note here that my current thinking on formatting a message to parts is represented in the Intl.MessageFormat proposal here: https://github.com/tc39/proposal-intl-messageformat#messagevalue

In brief, I now think that the most appropriate part-like representation of a resolved message in JavaScript is a list of MessageValue objects which may each be toString() stringified, or, if available, split toParts() to produce a JS formatToParts representation.

I do not think that this representation necessarily makes sense in all environments, as it ends up relying on specific implementation choices and deeply interacting with its JS Intl surroundings. I also don't see benefits from having the parts representation being spec-mandated, rather than being determined by each implementation.

Nov 15 '22 10:11 eemeli

@aphillips Replying here to https://github.com/unicode-org/message-format-wg/pull/396#issuecomment-1618726401, as this seems like a more appropriate place for the conversation; see above for some prior related discussion.

Ultimately, though, my meta-point is: we should not defer "formatToParts" down the road much further. We should deal with it here to ensure that implementations can expose non-string resolution of parts, including nested sequences. Your original reaction was to my saying:

An "expression part" can be resolved to a sequence of zero or more "literal parts".

Notice that this allows the string resolution for an expression to be empty. And it requires that an "expression part" be ultimately resolvable to a literal. What it doesn't say (it probably should) is that an "expression part" doesn't have to directly resolve to a literal.

We can and should add the necessary support for non-string "expression parts". But your proposed text and the back of my napkin are both dealing with the string resolution bit. Would it help if the above said:

The string output of a message is the concatenated sequence of all parts once they have been resolved to a literal. Expression parts SHOULD NOT be resolved to a literal until required to do so by the caller (e.g. in a toString function or method) or because that is the preferred output by the expression's implementer (as in the datetime example in this section)

Ah, ok. So do I understand right that you're advocating for us to define a "format to string parts" API, and that if an implementation were to want to represent non-string-y values in expressions, then the implementation would need to provide a separate API for that?

Thus far, I have been working from the presumption that an "expression part" in a "format to parts" API would have at least the following qualities:

Identify the type of the resolved value
Enable access to the resolved value
Enable access to some string representation of the resolved value
Where appropriate, enable access to an Array<{ type: string, value: string }> representation of the resolved value

Would you agree with the above, or do you think that e.g. 2. should be left out?

Would this representation in my fake JSON make sense:

{
   "locale": "ar-AE",
   "direction": "ltr",
   "parts": [
       {
           "type": "literal",
           "locale": "ar-AE",
           "direction": "ltr",
           "value": "Your image is "
       },
       {
           "type": "expression",
           "locale": "ar-AE",
           "direction": "rtl",
           "value": [
                  { "type": "image", "locale":"ar-AE", "dir": "rtl", "name": "image", "src": "image.jpg" }
           ]
       },
      {
         "type": "literal",
         "locale": "ar-AE",
         "direction": "ltr",
         "value": " Isn't it pretty?"
      }
   ]
}

In principle, it seems to make sense if we only care about string output, but I'd leave out the locale & direction from all but the message and expression elements. Literals at least should inherit the message's properties.

Jul 04 '23 08:07 eemeli

@eemeli wrote:

Ah, ok. So do I understand right that you're advocating for us to define a "format to string parts" API, and that if an implementation were to want to represent non-string-y values in expressions, then the implementation would need to provide a separate API for that?

Actually, no. I don't think we are required to mandate any specific APIs.

What I'm trying to lay out is an approach to organizing the formatting spec.

The text portions of a pattern take care of themselves: they are always strings and their contents are otherwise opaque to us.

The expression portions of a pattern are a different matter and are somewhat complex. For example, if an expression contains only a variable ({$value}), that doesn't mean that the value can only be converted to a string. It could mean that the implementation applies a default formatter for a given type of value. Thus if $value is a Date or Temporal in Java, the implementation might apply the :datetime formatter.

If the value of a variable is a literal, it still might be formatted through a function and not just returned verbatim.

And we've discussed elsewhere that a function can return a sequence of "parts" for decoration.

Thus far, I have been working from the presumption that an "expression part" in a "format to parts" API would have at least the following qualities:

I think each "part" would have properties and the list of properties would be the same for each part--text, literal, or otherwise:

the locale
the direction
the "type" of value (meaning text, literal, or "list of parts", not classical types like int, float, etc.)
the resolved value (it does not say "string" here on purpose)

In principle, it seems to make sense if we only care about string output, but I'd leave out the locale & direction from all but the message and expression elements. Literals at least should inherit the message's properties.

I don't think the former is true: we care about specifying how a message is resolved. "to-string" is only one of the ways a string can be resolved (even if it is by far the most common).

I made a point about including the locale and direction because I want each part to have the same set of properties. While some programming languages/environments can differentiate using (for example) class or reflection, others don't make this easy. I don't think it is good to have to write code that has to differentiate text and expression parts:

for (messagePart in mf.formatToParts(someArgs)) {
   someNode.lang = (messagePart.type === 'text') ? mf.getLocale() : messagePart.lang;
   someNode.dir = (messagePart.type === 'text') ? mf.getDirection() : messagePart.dir;
   // etc.
}

It's also the case that not all literal nodes will inherit direction or language (the text nodes would have to inherit it).

I should say more, but don't have the time today to work on it, but wanted to get some thoughts down quickly...

Jul 05 '23 16:07 aphillips

Very similar issue: "Decide on formatting to something other than text https://github.com/unicode-org/message-format-wg/issues/272"

Jul 10 '23 17:07 mihnita

Actually, no. I don't think we are required to mandate any specific APIs.

It sounds like all parties agree a firm API is not part of the specification, but there is good discussion on defining a structural definition of a formatted-part/MessageFormatPart type that, were such APIs implemented, they should satisfy? Please confirm/assert. In other words, new org.unicode.icu.MF2(...).formatToParts(...) is not a required method, but, we are defining a spec-supported structural type such that, if an API was implemented, it should adhere to the type?

@aphillips you mention you're "trying to lay out ... an approach to organizing the formatting spec". Forgive me, I'm not tracking how the rest of your message correlates to that end. I observe commentary on the parts schema design. Can you help set me straight, in simple terms?

Jul 19 '23 20:07 cdaringe

The main trouble with deciding in the spec for a certain structure is that it will have a big friction with existing implementations.

Yes, there is no MF2-like in MF1.

For example ICU formats to "something that implements the FormattedValue interface". https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/number/LocalizedNumberFormatter.html#format-java.lang.Number- (the FormatterNumber above implements FormattedValue, which is some kind of "format to parts")

Having the same kind of result from MF2 (a FormattedValue) means that I can look at the final result and "deep dive" in the parts of the elements. For example "We are open between {interval}" can be formatted to something that allows me to find the start and end part of the interval, and even the hour part of the start of the interval. If MF2 returns something else than the formatters, there is going to be a lot of duckt taping to convert to a new structure.

And it is not an ICU problem. I would expect that JavaScript (and a MF2 implementation in the browser) would feel (rightly so) that a DOM is the best "formatter to part" structure.

Android has Spannable, macos has AtttibutedString. So I assume native implementations for those platforms would like to not change that types to something else.

These are all structures that are "format-to-parts" like, but very hard (impossible?) to unify.

We can try to say what one might expect to find in such a structured result, but not the fields, or methods.

Jul 19 '23 22:07 mihnita

It sounds like all parties agree a firm API is not part of the specification, but there is good discussion on defining a structural definition of a formatted-part/MessageFormatPart type that, were such APIs implemented, they should satisfy? Please confirm/assert. In other words, new org.unicode.icu.MF2(...).formatToParts(...) is not a required method, but, we are defining a spec-supported structural type such that, if an API was implemented, it should adhere to the type?

I think we'd like to be a bit more conservative and agnostic, still. Rather than defining specific structural types, we can provide guidance to implementers about how to design formatted parts, and list a number of requirements that they should meet.

For example, based on https://github.com/unicode-org/message-format-wg/issues/160#issuecomment-1646558377:

The formatted parts should allow to identify their origin.
The formatted parts should be decorated with grammatical data.
The selected variant should be decorated with its keys.
Etc.

Jul 22 '23 11:07 stasm

Adding to the Stas' bulleted list:

some placeholders might produces a "collection of parts", not just one part
we should not consider "one iterator", but possibly several alternative iterators, or annotated text (no iterator concept)

Example:

{We are closed between {closesDates, :daterange, year=numeric month=full day=numeric}}

There is one placeholder (closedDates), but in the result I should be able to access the range, the start / end of the range, the month of the start of the range, etc. So these concepts are overlapping, no good way to represent as an iterator.

Might also have "annotations" that result in the final text being rendered differently depending on context. For example one might render as "12/23/2023" when rendered as text, or "December 23, 2023" when rendered as voice (by a TTS engine).

Jul 31 '23 21:07 mihnita

this may have been addressed in the F2F proposal for F2P (format to parts).

As mentioned in today's telecon (2023-09-18), closing old requirements issues.

Sep 18 '23 19:09 aphillips