message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

"Local" text transformations (contextual changes)

Open mihnita opened this issue 3 years ago • 3 comments

After placeholders are replaced with the final text the result should in some cases be "adjusted to match"

Common English example: "I received a {item}" The "a" there should become "an" for words that start with a vowel.

Might even want to be fancier, imagine handling "I received a <a href="link">{item}</a>", where the "a" should account for a vowel pretty far away, inside an html tag.

There are similar needs in other languages, for example in French la and le become l’ before a word that starts with a vowel. See https://en.wikipedia.org/wiki/Elision_(French)

"Contractions" similar to the French one are also common in Italian, Spanish, Portuguese. And this kind of "local transformations" might also help with Belorussian, Korean, others. (for Spanish see https://www.thoughtco.com/pronunciation-based-changes-y-and-o-3078176)

With a more powerful system these would be handled by specifying that "I want the definite article of the noun", and that would solve "a grape" vs "an "apple". But such systems are not always available. This can be a "poor's man" alternative.

In order to be able to do this "the smart way" one would need to know AFTER FORMAT that a certain part of the message comes from a placeholder. That way we don't "repair" text that was put there (intentionally) by the translator, but only text that was "created" by placeholders. This is why I named this "local" text transformation / contextual changes, as opposed to issue #38, which seems to apply indiscriminately on the whole (final) message.

The exact transformations are probably not part of the standard itself (but might be "registered"?)

I think that this can be implemented on top of the already proposed format-to-parts feature (issues #41) So I don't think that there is anything special to do with the data model if we already support that feature.

But better to list this explicitly.

mihnita avatar Mar 23 '21 18:03 mihnita

https://unicode-org.atlassian.net/browse/CLDR-14621

mihnita avatar Mar 24 '21 04:03 mihnita

I think the right thing to do in MessageFormat 2.0 would be to ensure that it's possible to implement transformations on formatted parts, while leaving out the actual implementation of such transformations.

One category of such transformations would be ones that are triggered explicitly at the boundary between two parts, e.g. for a {thing} to get formatted as "an example". In English this is relatively simple as the a/an difference is only based on morphology, but for others it's more complicated. For instance, take le {chose} in French; in many cases it's practically speaking required to be able to give the chose value not just as a string, but also with its gender attached, so that the output could be "la covid" but "le corona virus".

In order to make that work, the formatted parts would need to look something like this:

[
  { type: 'literal', value: 'le ' },
  { type: 'dynamic', value: 'covid', meta: { count: 'one', gender: 'feminine' } }
]

Given that input and the French locale, it'd be relatively straightforward to detect the le as a generic definite article, and to replace it with la on account of the one count and feminine gender. For l' some heuristic or other would still be needed to determine if covid starts with a vowel sound in French.

Now, working backwards from the formatted-parts output, how do we get the metadata attached to the part in the first place? One option would be a preceding dictionary-based transformation, which would indeed know that it's "la covid" and assign the metadata according to its own logic.

But should we do more? In particular, what do we do when "covid" is coming from a message reference? An obvious precedent to look at is Fluent and its message attributes, which cover exactly this use case. In the MF2 data model, such attributes would logically end up in the meta value of the relevant message, from which they could be included almost directly in the formatted part.

In other words while we have support for all this in the data model, it's underlining a need to account for this also in the syntax.

eemeli avatar May 02 '21 14:05 eemeli

Another thought: This connects with the previously-mentioned data registry for common selector types. Specifically, I'm reminded of this example https://github.com/unicode-org/message-format-wg/issues/3#issuecomment-576389845 of how Mozilla uses Fluent selectors:

-brand-name = { $case ->
    [nominative] Firefox
    [genitive] Firefoksa
    ...
}

If we know that a case is expected to take values such as "nominative", "genitive", and so on, we could use a heuristic to set a value for the case metadata of the formatted part based on a usage pattern as above.

eemeli avatar May 02 '21 15:05 eemeli

I suspect that this is either (a) the province of a function or selector or (b) out of scope.

That is, one might implement what @eemeli mentions above to create a :case selector (by associating gender, count, and article agreement with the data) or using some other means of selecting between patterns. If such a function is not part of the default registry, then that would make it out of scope.

@mihnita Any thoughts on where to go next here?

aphillips avatar Jul 16 '23 19:07 aphillips

I think that various language features will be supported by selectors / formatters / post-processing. Post processing might be done either on the "formatted to parts" result, or maybe earlier (internally in MF2, with the format-to-parts already having the transformations applied).

My expectation (hope?) is that we will see more and more ML at work, and the role of MF2 will be to provide hooks and hints.

Even apparently straight-forward rules are in fact pretty tricky. For example in French one would think that le/la followed by a word starting with a consonant becomes l’ (l apostrophe).

s / followed by a word starting with a consonant / followed by a word starting with a vowel / (thanks Asmus)

But that about "la sauvage" ("the wild ")? If we look at the text, "sauvage" starts with a consonant, and we should keep "la" as is. But what if the <img> (or an emoji) is is intended as "text"? Imagine it's a bee ("abeille") The text "la 🐝 sauvage" is really "l'abeille sauvage" (the wild bee).

mihnita avatar Jul 20 '23 22:07 mihnita

For example in French one would think that le/la followed by a word starting with a consonant becomes l’ (l apostrophe). But that about "la sauvage" ("the wild ")?

Don't you mean "vowel" here?

I agree with the wider point about inline, pronouncable, images.

However, this would assume that you provide the alternative text so it directly substitutes for a noun or noun phrase and in a way that universally matches what a native user would choose...

asmusf avatar Jul 20 '23 23:07 asmusf

I would think that specific language rules are outside our scope. My point regarding a selection on case is that with some selectors, a post-processor applying such heuristics or logic could find the selected key or keys to have significant value.

As a toy example, let's say that we have a language model or or other system that's capable of "fixing" or "improving" slightly broken messages, but that running it is expensive. Now, let's suppose that we have a message like

match {$foo :person-gender}
when female {She did a thing}
when * {They did a thing}

and we call this with a $foo for which :person-gender would prefer male, but that's not available for this message in this locale, so we end up with the fallback un-gendered message.

Now, if we could include that selection preference and result in our output somehow, some pretty simple logic could see that we've ended up with a fallback message, and use the expensive system to ask for the "male" version of "They did a thing".

Alternatively, with the same message but without the semi-magical LLM, we could be formatting a whole set of messages together. Then, we could note that for some of them, the preferred gender message is not available, and it's using a fallback. If we wanted all the messages to correspond with each other, we could then re-format the gendered ones to be neutral, rather than presenting the user with a mix.

eemeli avatar Jul 21 '23 06:07 eemeli

There are a few requests bundled here. Let me try to untangle them:

(I'm using "should" to define the requirements, not voice my opinion.)

The formatted parts should allow to identify their origin.

I.e. whether they were literal text or a product of a placeholder.

We don't have a concrete design of the interface of formatted parts (#41), and I think we also prefer to stay agnostic wrt. the exact data shapes, but I think it's safe to assume that all formatToParts implementations will want to do this.

Recommendation: add this to the requirements for formatToParts implementations.

The formatted parts should be decorated with grammatical data.

Or more broadly: data relevant to the formatting of the placeholder, to allow further context-aware transformations.

I don't remember if we explicitly discussed this. My position is that it would be great to allow formatToParts implementations to extend parts with such data. Exact data shapes are most likely implementation-specific. An example of what this could look like, based on my implementation: message2//grammatical_agreement.ts.

Recommendation: add this to the requirements for formatToParts implementations.

It should be possible to run text transformers on formatted messages.

The logic of text transformers is outside the scope of MF2, but we'd like to make sure it's possible to run text transformers on formatted parts.

This is satisfied to some extent just by formatToParts existing. Transformation layers can be built on top formatToParts, just like any other extra logic.

We can also consider ways of plugging transformers into the formatting runtime, similar to how we provide extension points for custom formatting and matching functions.

I think this could be the same mechanism for both "local" transformations and pattern-level ones, which is #38.

Recommendation: Discuss a new kind of extension points for patter-level transformers, possibly defined in the registry.

The selected variant should be decorated with its keys.

This is similar to the second requirement above, but for the whole selected variant. For example, formatToParts could not only return an iterator over formatted parts, but also other data:

interface MessageFormat {
    formatToParts(): FormattedVariant;
}

interface FormattedVariant {
    parts: Iterator<FormattedPart>;
    keys: Array<string>;
}

While a lot of the same information can probably be extracted from the decorated formatted parts, I think decorating the variant itself can be particularly useful for variants with no placeholders. E.g. when 1 {One thing}.

Recommendation: add this to the requirements for formatToParts implementations.

stasm avatar Jul 22 '23 11:07 stasm

I think it helps focus our thinking if we have several examples of of any principle. For example:

  1. We need to know boundaries and origin information, recursively. These are for placeholders, but also placeholders within placeholders.

Examples:

  • Embolden the month in a message that contains a date: You last visited on March 3, 2022.

  • Line up messages visually in a column, on the right side of the integer part of numbers (whether integers or fractions)

666.3 credit
  4 debit

...

  • Perform orthographic fixes on insertion boundaries You bought a apple. (Internally You bought a {apple}) ⇒ You bought an apple.

Note that for placeholders we are dependent on the formatting functions to supply the information.

Visually, something like: Screenshot 2023-07-22 at 10 33 33

macchiati avatar Jul 22 '23 17:07 macchiati

Thanks @macchiati for the lovely illustration (which mirrors my thinking regarding format-to-parts behavior).

I'm removing resolve-candidate for now, but observe that this issue is somewhat ill-defined in what it is asking for. @mihnita can you clarify or consider breaking up the issue into specific requests for syntax, formatting, data-model, or registry changes?

aphillips avatar Jul 29 '23 16:07 aphillips

can you clarify or consider breaking up the issue into specific requests for syntax, formatting, data-model, or registry changes?

I think it is a bit early to split into syntax / formatting / etc. That is already close to a solution. And I don't think we have an agreement (yet)

I think the items Stas added are good.

What I am thinking might help is to add another flag to the placeholders (propagated all the wain the "format to parts") saying "this is not rendered as visible / audible text" or "this is decoration only" or something like that.

Meaning that post-processing steps can ignore that whole "set of parts" because it is not visible.

For example in "a {+b}{+i}{fruit}{-i}{-b}" the bold and italic are not "rendered as text", so they don't affect the linguistic processing of "a" (to become "an" or not). From text processing perspective they behave as if they are not there. As opposed to an image.

IF there is agreement on "yes, such a flag would be good", then we get to debate if it belongs in the placeholder or registry, syntax, etc. I don't think it belongs in registry, because we don't want / need the "grammatical processing engine" to have to know about the registry, at runtime.

mihnita avatar Jul 31 '23 21:07 mihnita

I think the topic of this issue is pretty clear in the original post (formatted parts should identify which placeholder, if any, they come from), but the discussion since then has wandered quite a bit.

It would be useful to record any/all such wanderings as separate issues, so that they don't get lost if/when this issue is closed by addressing the original suggestion.

eemeli avatar Aug 01 '23 09:08 eemeli

I can't tell what this issue is about (what is being specifically proposed). I agree with @eemeli's observation about formatted parts identifying their placeholder. Does that mean we've addressed this? If not, can we make a specific issue or issues for what it needed? Thanks.

aphillips avatar Dec 04 '23 15:12 aphillips