message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

Requirements - MF wishlist

Open romulocintra opened this issue 5 years ago • 65 comments

List of requirements to consider for MF

romulocintra avatar Nov 27 '19 18:11 romulocintra

I'am listing requirements from the 1st meeting slides :

List of possible requirements

  • Easier to use ICU “Select”
  • Fluent could be considered as a starting point for the future of message format
  • Have pluggable “formatters”(Date/Time/Number ...)
  • HTML Markup
  • Cross-platform / Universal Format
  • Messages should have more context “description” or ”metadata”
  • MessageFormat - More Readable
  • Escaping(“ or ‘ ) and Interpolations (html tags)
  • Rule Modifiers - Send Message or Send SMS  -> similar to select ICU feature
  • Improve Translators / Developers  UX/DX
  • I need to somehow be able to cache my translations
  • Use Yaml or JSON as file format
  • Message reference - from another Message

romulocintra avatar Jan 06 '20 19:01 romulocintra

Proposal for an additional requirement:

  • Provides a translation of an XML/HTML element.

zbraniecki avatar Jan 06 '20 19:01 zbraniecki

Sorry, I wasn't there in the first meetings so I'm not sure what is meant with "HTML Markup"?

But:

  • fully agree on custom pluggable "formatters"

And add:

  • extended plurals, like:
{ count , plural ,
   =0 {No candy left}
  one {Got # candy left}
  <10 {Got a few candies left}
  10-20 {Got a handful candies left}
other {Got # candies left} }

edit: in i18next we use a postProcessing plugin to achieve that: https://github.com/i18next/i18next-intervalPlural-postProcessor#usage-sample

jamuhl avatar Jan 06 '20 19:01 jamuhl

HTML Markup

Ability to interpolate localization with HTML. Example:

<span>You have <b>6</b> unread messages from <img/> Mary.</span>

Fluent provides DOM Overlays which are heavily used in Firefox l10n - https://github.com/projectfluent/fluent.js/wiki/DOM-Overlays

zbraniecki avatar Jan 06 '20 19:01 zbraniecki

@zbraniecki thank you for explaining...so basically take the innerhtml element(s) and extend it with the attributes and content contained in the translation...looks similar to the Trans component we have in react-i18next -> https://react.i18next.com/latest/trans-component (just we have no html elements but react components)

edit: guess we could mimic DOM-Overlays by extending our Trans component...just not sure if this is part of the syntax or an extension that is provided by the i18n library?

jamuhl avatar Jan 06 '20 19:01 jamuhl

@mihnita should i reference here the your entire document or we can break it in features to add here ?

romulocintra avatar Jan 06 '20 19:01 romulocintra

In our experience innerHTML in particular is a no-go for security reasons (l10n resources are treated as a third-party). I expect the requirements from the W3C to be similar here.

Instead, we whitelist allowed textual elements (<sup/>, <sub/>, <span/> etc.) and for everything else we require the developer to provide the elements in the source with a name, and then the localizer can position them using the same name:

<p data-l10n-id="key1">
  <a href="https://www.mozilla.org" data-l10n-name="link"/>
  <img src="./pics/img1.png" data-l10n-name="logo"/>
</p>
key1 =
    Welcome to <a data-l10n-name="link">Mozilla</a>!
    Please, click on <img data-l10n-name="logo"/> to proceed.

That's significantly more involved than innerHTML, but the end result is quite similar with a lot of linting, security, and sanity checks. We're also discussing further extensions - https://github.com/zbraniecki/fluent-domoverlays-js/wiki/New-Features-(rev-3)

zbraniecki avatar Jan 06 '20 19:01 zbraniecki

innerHTML was more referring to the content than to the implementation detail...same reason we do not just append translations into a react element by using dangerouslySetInnerHTML ;)

jamuhl avatar Jan 06 '20 19:01 jamuhl

I will break into features.But maybe also link, so that others can read the complete doc.I think that the current list of features will also need to "grow" with some more details. As it is some of them are so short that only the one who proposed it really understands what it means :-)MihaiOn Jan 6, 2020 11:41, Romulo Cintra [email protected] wrote:@mihnita should i reference here the your entire document or we can break it in features to add here ?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

mihnita avatar Jan 07 '20 15:01 mihnita

@mihnita

  • If you can break the into features great and link is important to Both are important
  • I completely agree that some of features wont fit in one line and will need more detail, that ones IMHO deserve a unique issue or thread.

My Proposal :

  • If you can break it into features will be perfect(agree that the link is important too)
  • Some of the features won't fit in one line description needing more detail, that ones IMHO deserve a unique issue or thread, I suggest that we can create a new Issue tagged as "requirements", where we have all detail and discussion about that issue, but we can keep a reference with description here to keep the list in only one place.

I feel that also the short description ones will grow to have their own issue/task, but I think we can figure out later after we groom and filter the tasks/lists of requirements

romulocintra avatar Jan 07 '20 17:01 romulocintra

My proposal for the process @romulocintra is to set a deadline, then de-dupe the list, then prioritize into mvp, v1, v2... so we can move this along.

longlho avatar Jan 07 '20 17:01 longlho

My proposal for the process @romulocintra is to set a deadline, then de-dupe the list, then prioritize into mvp, v1, v2... so we can move this along.

@longlho i believe this(process , mvp , roadmap , goals) must be addressed in #4 where we can define all related organizational and process as a team.

Related with this task and regarding how we organize the list, I think the previous proposal can fit our current needs, I did not propose any deadline for this task but I see next meeting as a good candidate to prioritize/filter/de-dupe the items originated in this thread. finally, we can review #4 to close all the organizational issues, deadlines and goals.

Meanwhile, I'm referencing your comments in #4

PS: just added this topics to the next meeting agenda

romulocintra avatar Jan 07 '20 18:01 romulocintra

Right now, in ICU4J, if you do: "You owe {someNumber, number, currency}." - then the actual currency is inferred from the current locale - which is just nasty.

You can do this: "You owe {someNumber, number, :: currency/JPY}." - but this means that you know in advance that you're dealing with a specific currency - JPY - in this case. One should be able to declare the actual currency at run time. Perhaps Fluent already supports this?

MickMonaghan avatar Jan 12 '20 21:01 MickMonaghan

Sorry for joining the conversation late and having to leave the last session early but here is my take:

  • Make the syntax cross-language/cross-platform. Maybe having an RFC and/or improved (non-technical) documentation of the syntax would help?
  • See if we can make the syntax easier to read (not just for developers, but presuming "raw" syntax could also be translatable by linguists)
  • Provide free tools with the syntax for authoring and translation (our own online CAT tool?)
  • Extend selectors (I like @jamuhl's example and will have other to present in the next session)
  • File format-agnostic - not all TMS does a good job supporting file formats. If the syntax is independent it makes it more flexible to adopt
  • Leave the syntax markup (e.g. HTML) agnostic - the syntax should be able to accept HTML or any other markup but the TMS and or library can implement manipulation how it find best for its use case
  • Find better ways to escape the syntax (' is way too common and the current escape patterns could be possibly standardized/simplified)
  • Add more features:
    • Predefined Linguistic selectors (will be presenting this idea in the next meeting)
    • Improved list support
    • Better currency support
    • More flexible formats (extendable inline?)
    • Numbers to "written numbers" convertor?
    • Inflections (genders, articles, declensions, etc.)

nbouvrette avatar Jan 12 '20 21:01 nbouvrette

Can we keep the language used to retrieve the UI strings separate from the language/locale used to format variables/placeholders within a string? This would be consistent with how some OSs and some string formatting libs already separate UI language from locale formats.

MickMonaghan avatar Jan 13 '20 16:01 MickMonaghan

Perhaps Fluent already supports this?

Fluent does support it, it's called "partially formatted variables" and currency was the particular example that drove that feature.

The way it works in Fluent is this:

ctx.format('product-cost', {
  amount: FluentNumber(342, {
    currency: "JPY",
  })
});
// Translation can just use "default" formatting options
product-cost = This product costs { $amount }

// Or a translation can specify its own list of options (based on ECMA402 NumberFormat

product-cost = This product costs { NUMBER($amount, minimumFractionDigits: 3) }

An important bit is that the selector (NUMBER) limits which options can be provided by the translator - in case of number, currency is not available for the localizer to specify.

zbraniecki avatar Jan 13 '20 18:01 zbraniecki

Provide free tools with the syntax for authoring and translation (our own online CAT tool?)

Fluent comes with a CAT tool - https://github.com/mozilla/pontoon / https://pontoon.mozilla.org/ A lot of effort in Pontoon at the moment goes into better WYSIWYG for Fluent selectors.

Leave the syntax markup (e.g. HTML) agnostic - the syntax should be able to accept HTML or any other markup but the TMS and or library can implement manipulation how it find best for its use case

I'm not sure if I agree. Features like compound messages are important only when you look at the problem in context of UI widgets. The drive to be agnostic may lead to a syntax that is not really optimized for anything. While I agree that we should ensure the syntax and data model are useful for wide range of software use cases (and not, say, just for Web/React), having some "P1" targets would help us bring something actually useful imho. In particular, from my angle, understanding that Software UI is not created by a bunch of imperative calls from JS/C/Java, but is usually defined in some declarative markup is fundamental to how you design features. If we reject this hypothesis, it will have deep implications on what we end up with.

zbraniecki avatar Jan 13 '20 18:01 zbraniecki

I previously gave a presentation called Let's Come To An Agreement About Our Words. The presentation covers an older format that we used in Siri, and we're migrating to a newer simplified format. Here are some highlights on what it can do or found was desirable.

  • It's generally an XML format. The original would use something like ECMAScript/Java beans/UEL for referencing variables and its properties. The UEL syntax was too complicated and was changed to favor more XML with a nicer editor, much like your favorite word processor stores its data in XML without the end user really knowing that low level detail. It's also easier to interchange it with XLIFF when it's XML.
  • Support for SSML is very desirable for screen readers or virtual assistants.
  • The messages are by default both printable and speakable, but you can exclusively print or speak a phrase. If you ever need to explicitly speak a number within a given context, this is critical.
  • Word inflection and grammeme detection (values of grammatical categories) are fundamental parts of the syntax. It's critical functionality with user provided vocabulary. Generally, you need to know the grammatical number, grammatical case, the grammatical gender of the words and the pronunciation of the word (generally just if the word starts or ends with a vowel).
  • Word inflection can include adding prepositions, articles, pronouns or grammatical states of a given word. For complicated examples, check out Russian, Korean or Arabic.
  • Number pronunciation is provided by CLDR's RBNF.
  • Getting a number and noun into grammatical agreement is critical. The grammatical gender of the number comes from the noun. The grammatical number of the noun is generally affected by the value of the number (e.g. 1 or 2). The grammatical case is defined by the translator given the context of the sentence. The translator does not provide the exact inflections by default.
  • List handling involves inflecting each word. This might mean making each item the definite form.
  • The "and" (AKA conjunction) list, and the "or" (AKA disjunction) list are able to handle the context correctly for Italian, Spanish and Korean.
  • There is also the adjective list, which is probably the hardest to get correct for English. For Chinese and Korean, it's a lot easier.
  • There is a calendar concept based mostly on CLDR's translations. Some functionality is provided to add preposition or postpositions as needed. The grammatical case can be modified as needed. CLDR doesn't handle grammatical case modification that well by default.
  • There is a measurement concept that is separate from CLDR's implementation to provide precise translations of units of measure, like kilometers and miles. CLDR is more focused on the printable form instead of the speakable form, which is why CLDR is generally ignored when the speakable form is also needed.
  • It has a highly customized currency concept. CLDR only partially covers support for this functionality. Pronunciation of a currency for its units and subunits in native and foreign contexts is important.

This functionality works or is shipped on Linux, macOS, iOS, tvOS and watchOS. The watchOS support is probably the important thing to highlight because it is the most resource restrictive environment to support. I'm just stating that this functionality can live in resource constrained environments where grammatical correctness of a message is important.

grhoten avatar Jan 13 '20 19:01 grhoten

Can we keep the language used to retrieve the UI strings separate from the language/locale used to format variables/placeholders within a string? This would be consistent with how some OSs and some string formatting libs already separate UI language from locale formats.

While we definitely experienced a very vocal community of users of Firefox who want to use different translation from locale formats, this has also been a trap for regular users because date/time formats often contain translations.

For example, Japanese 2020年1月13日 星期一 下午12:03:10 or 星期一 下午12時 (for { weekday: "long", hour: "numeric" }) would be very confusing if placed in a sentence with different locale.

There are even extreme cases. If the user had german translation, with a date that is formatted in en-US, there's a chance of flipping MM/DD and DD/MM order. If the sentence is in german, user has the right to interpret the "05/08" using german "DD/MM" pattern, and be very surprised if they later learn that it was actually en-US "MM/DD` taken from their OS locale formatting preferences.

My initial position is that we generally should, by default, format placeables (numbers, dates etc.) using the same locale as the translation is in, and allow for the develop to provide an alternative language negotiation for formatters in order to handle exceptions like you mentioned.

This is also important once we start talking about the error handling UX. Fluent has been designed to fallback using a locale chain, so if there's an error or missing string in the primary language, we'll fallback on the second best choice, rather than display an error and break the app. It's an important resilience measure for us. What's interesting is that that means that the locale chain used for formatters is per-bundle so that in the locale context ["fr-CA", "fr", "en"] we first try to localize a message in fr-CA using fr-CA formatters, but if there are errors and we end up localizing the message using en resources, we'll format the date/times using en locale.

zbraniecki avatar Jan 13 '20 20:01 zbraniecki

@grhoten - this is awesome! Thank you for sharing!

We have some experience with TTS in form of Common Voice project which uses Fluent.

While I don't see it in the translation resources they use now, I remember that in some variant of the project they used fluent's compound messages to represent the spoken/written difference:

time-is =
    .written = { $time }
    .spoken = The time is { $time }

It was an unexpected use of the compound messages, but brought up the idea that having message variants that are recognized as a single unit (with comments, invalidation rules, fallbacking together etc.) is important.

zbraniecki avatar Jan 13 '20 20:01 zbraniecki

Most OSes allow for a separation between the formatting locale and the resource locale, but it is not always explicit.

It is a really useful thing for regional variants. Most applications are localized into Spanish, French, Arabic, etc. Rarely there is a "flavor" like Spanish-Latin America

But there are tens of countries using each of these languages, and they use different date / time / number formats.

So for the user it is best if one can use the French-Swiss locale (for example), and that will format things for fr-CH, but load the fr resources, with fallback.

If the fallback is granular enough (for instance on Android and Java it is string level) then one can have (for example) everything translated into French, and a document (or string) for fr-CH to cover country specific stuff (think legal, or special functionality)

Not all systems have a way to tell that the strings really come from "fr". The "application locale" is fr-CH, and the is used for everything.

So you never get weird mixtures like French strings + German dates.

But I think that we should do better than to format using the same locale as the translation.

Not the same locale, but not 100% independent either.

I can explain how that works in Android, for example.

Cheers, Mihai

mihnita avatar Jan 13 '20 22:01 mihnita

About extended plurals, like:

{ count , plural ,
   =0 {No candy left}
  one {Got # candy left}
  <10 {Got a few candies left}
  10-20 {Got a handful candies left}
other {Got # candies left} }

I am quite reluctant about it. There is something similar in Java (ChoiceFormat) Example: "-1#is negative| 0#is zero or fraction | 1#is one |1.0<is 1+ |2#is two |2<is more than 2."

And it was a huge problem for proper localization. It was banned in most places I've been.

mihnita avatar Jan 13 '20 22:01 mihnita

"You owe {someNumber, number, currency}." - then the actual currency is inferred from the current locale - which is just nasty.

@MickMonaghan I agree. Actually, currency formatting that I've been involved with disallows this scenario. Currency formatting is a measured unit and not a number. The unit has to be explicitly defined outside of the current message.

grhoten avatar Jan 13 '20 22:01 grhoten

I am quite reluctant about it.

I agree with @mihnita. Such translations are rejected by the Mozilla L10n Drivers and the logic we use is that this is not a plural-based variant of the same string, but a set of separate strings, and which one to use should depend on some other selector than a localizer trying to build a selection like in the example. We documented that recommendation in https://github.com/projectfluent/fluent/wiki/Good-Practices-for-Developers#prefer-separate-messages-over-variants-for-ui-logic

zbraniecki avatar Jan 13 '20 23:01 zbraniecki

About editors for developers / translators: I would rather have a standard mapping to XLIFF for translators. It would work better with the existing tools, instead of forcing translators to "get out" of their existing tools, edit somewhere else, then bring the string back in (usually with copy/paste) And to that every time one needs to fix something.

Similar with developers: it is better to provide plugins for existing IDEs (Eclipse, Intellij, Visual Studio Code) than a standalone editor. And we don't need to write those plugins ourselves.

mihnita avatar Jan 13 '20 23:01 mihnita

Some extra bullets to the wish list. I've tried to not add things already listed, but I am not sure I managed 100%.

  • Support the reunion of functionality of both Fluent and MessageFormat, even if the final syntax looks like neither.
  • Plural / select / ordinal (more?) should apply to the full messages, not fragments (which is usually bad i18n)
  • Need the ability to add metadata for messages AND placeholders.
  • Allow parameters to get metadata from translators or from automated systems. For example if a message has a parameter with 10 possible variants (from resources) a translator (or a "service") might be able to add an piece of metadata saying that this is a "noun, singular, masculine". Kind of related to inflections, but not really. I think I need to add more info on this.
  • Ability to protect sections of the message
  • Open / close / standalone placeholders, and flags for placeholders. See canCopy / canDelete / canOverlap in the XLIFF 2.1 spec (http://docs.oasis-open.org/xliff/xliff-core/v2.1/os/xliff-core-v2.1-os.html). This might overlap a bit with the html support, but it is a bit more generic.
  • Design things to be very modular.

Some thoughts on modularity:

  • There should be a "resource manager" that loads messages and deals with language negotiation, fallback, etc. That would be "an interface", with a default implementation, but developers should be able to provide their own. That deals not only with strings, but also other localizable resources (sound, video, images, styles, etc). It also works with the MessageFormat to load referred messages.
  • The syntax of the arguments passed to an API should be specified separately from the full storage format
  • Would be nice if the syntax used without an API (for binding) should be very close to the one for the API.

mihnita avatar Jan 13 '20 23:01 mihnita

  • No positional variables. No foo {0} bar and so on. All variables should have ids, both to improve readability and error recovery.

zbraniecki avatar Jan 14 '20 00:01 zbraniecki

Great initiative and for me perfect timing given I've been implementing functions for CLDR over the last 2 years (not based upon ICU). Reflecting on that experience and the great comments from here from people who have vastly more experience than me, I offer the following thoughts on requirements:

TL;DR (summary of thoughts)

  • Focus on a standard message format that can be expressed in at least standardised string and HTML formats

  • Use a format that has at least a good chance that the UI designer, the developer and the translator can grasp

  • Include an interchange format as part of the spec but don't include a storage representation in the spec which facilities sharing and integration with tooling. Will be an important part of driving adoption.

Problem domain

  • The WG is called "Message Format". Taking that spirit it would seem the shared domain of interest, irrespective of development language or deployment platform, is defining a canonical format for localisable messages. The API for such messages would, it seems to me, be an implementation detail outside the scope of the WG.

  • The purpose of messages is to express common intent between a UI designer, a developer, a translator and a user. So irrespective of the representation (or representations) chosen, to the extent possible, reading the message in the code should convey intent that is largely understandable by all stakeholders (ok, not the user).

  • It would also seem in scope to define a standard interchange format (see below). Development and runtime environments vary a lot but each benefits from sharing data and integrating into CAT and other tooling.

Format representations

There are at least three representations useful for messaging I can see:

  • Storage representation. Think .pot files which, despite being gettext oriented, do appear to vaguely recognise other messaging formats. But really, this is the typical resource bundle in some format appropriate to the development and runtime environments. I would propose this is not in scope for the WG since there will be a lot of variability. One representation I'm working on doesn't even have a static resource bundle but has updatable translations via websockets (server-side orientation)

  • Interchange representation. The canonical representation that can be shared amongst all implementations of whatever comes out of the WG. Arguably this is one of the reasons that gettext has strong adoption - a common file format that has a lot to tool support. XLIFF 2.1 would appear a strong candidate since it has a formal structure and specification and it supported by CAT tools. But it isn't (by design) easy to consume for UI experts or translators.

  • Source code representation(s). I see comments here mostly around string-based and HTML-based representations which makes a lot of sense. In each and any case I would like to see a format that is not white-space sensitive. The reason being that eventually some tooling has to decide if message a is just a transformation of message b and a common approach is hashing the message. In this case a canonical format of the message is required so that hashing is consistent. And thats hard to do if the format is whitespace sensitive (as the current ICU message format is).

Relationship to CLDR

I see several comments reflecting that message formatting in some areas would benefit from enhancing CLDR data. Formatting units of measure is a good example. Without building an unreasonable dependency, making recommendations to CLDR would be a very useful in advancing the overall I18n, l10n world.

kipcole9 avatar Jan 14 '20 02:01 kipcole9

There is one additional thing to mention. Good pronoun handling is hard to do. Arabic is by far the hardest to do. You morphologically attach a suffix to the given user vocabulary, which isn't trivial string concatenation. You need to know the gender of the pronoun subject, the grammatical number (singular, dual or plural), and you need to know the type of pronoun (e.g. possessive, reflexive and so on). Without this, Arabic speakers have to rewrite translations in less natural grammar.

Hebrew also has to know the gender of the people being referenced.

In German and Russian, it's more about the gender of the noun instead of the person being referenced.

You can also get into issues with how people want to be referred to. A person may not like a pronoun involving "him", "himself", "her" or "herself" for various reasons.

I have yet to see really good pronoun handling. It's hard to get correct.

grhoten avatar Jan 14 '20 07:01 grhoten

This is a use case but has requirement implications in terms of specification:

  • Enable round trip through XLIFF (and possibly other localization formats). In other words, there should be a well-defined way to convert between such a message format and XLIFF.

Example: http://docs.oasis-open.org/xliff/xliff-core/v2.1/os/xliff-core-v2.1-os.html#dataref

This example from the XLIFF spec shows some kind of message format Error in {0}. converted to XLIFF:

<unit id="1">
  <originalData>
    <data id="d1">{0}</data>
  </originalData>
  <segment>
    <source>Error in '<ph id="1" dataRef="d1"/>'.</source>
    <target>Erreur dans '<ph id="1" dataRef="d1"/>'.</target>
  </segment>
</unit>

srl295 avatar Jan 14 '20 22:01 srl295