message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

File Format

Open nbouvrette opened this issue 4 years ago • 9 comments

Is your feature request related to a problem? Please describe. To store strings, we need to support at least one file format. Message Format currently is agnostic of file formats which means that some topics like context (see issues 39 and 40 should either be considered upfront or as a separate topic.

Describe the solution you'd like I would keep the new standard file format agnostic (but would be curious what others think).

Describe why your solution should shape the standard By not requiring a new file format I see the following benefits:

  • More flexibility (users pick what they prefer)
  • No need to develop yet another file format while we have so many linguistic issues to resolve
  • New file formats can also be done as a separate effort and the new syntax could use it (I don't think there is a clear dependency for me)
  • No need to have tools and TMSes support the new format
  • Ultimately faster and easier adoption

Additional context or examples Most TMSes already support +20 file formats, and while I agree that Fluent's current file format seems like the most appropriate for localization, wide adoption can take time and other formats are already widely used. There are already a lot of formats that can support context and that are well supported across languages. The most popular formats I have seen:

  • .properties - simple key values with comments that can be used as context
  • JSON - very powerful, and something hard to integrate with TMSes depending on the complexity of the schema. Context can also be embedded.
  • XML - same as JSON but probably even more broadly supported since XML has existed for longer

nbouvrette avatar Feb 15 '20 17:02 nbouvrette

I think there are 4 related concepts:

  1. The data model for a single message, possibly with some metadata.
  2. The syntax for a single message, e.g. MF's pattern strings.
  3. The data model for resources, defined as collections of messages.
  4. The syntax for resources.

When you say file format agnostic, do you mean (3) or (4)?

stasm avatar Feb 17 '20 14:02 stasm

When you say file format agnostic, do you mean (3) or (4)?

I am talking about the format of the file where you store the key/values (strings) which may or may not contain MF (or other) syntax - if I understand correctly this would be (4)?

But I'm starting to realize there might be a part missing in the conversation here (related to this thread as well).

From my perspective, there are two key concepts part of the localization chain - everything else is more on the implementation side:

  1. Storage of key/values (typically in a file, but this could also live in a database or other storage)
  2. Those key / value may or may not include syntax (in the value part, aka strings) but if they do, the parser of this syntax should be able to render them accordingly to their design, with the correct data

In my view of the simplest solution, you can use any file format to store your strings and use the syntax or not, as you see fit. The syntax is simple and can be translated seamlessly using modern continuous localization platforms, without the need of import/export scripts.

This is the part I think I will try to clarify because I'm not sure how everyone is doing localization today, but to me, having conversion scripts, or build scripts could be avoided by keeping everything simple enough - if this is even possible with the extended scope.

All I can say is that this is already possible today :-)

nbouvrette avatar Feb 18 '20 12:02 nbouvrette

From all I see in the discussions of various features we seem to want eat the cake, and have it too:

  • "keeping everything simple enough"
  • "we want to support a lot of linguistic features" (inflections, concordance, etc.)

I like both of these bullets. But I don't think it is realistic to expect to have both :-) (note: that is "simple enough" for a developer is not at all like that for a linguist)

mihnita avatar Feb 18 '20 21:02 mihnita

As far as I see ... the mixing of requirements/features and the binomial between form and function must be kept whatever type or format of file we end up.

I see a strict dependency on a well defined Data Model that will have the same complexity than the number of features we put on the bag.

I see the focus in tackling this part of the definition very important and consequently will drive the format or formats.

romulocintra avatar Feb 18 '20 23:02 romulocintra

Fair enough - the more I learn about languages the more I realize how much variety there is (funny example here: https://en.wikipedia.org/wiki/Boustrophedon)

But ultimately we are trying to build a useful solution. If the file type is not widely supported, or if the format is too complex, this could reduce the usefulness of the solution because of adoption barriers.

I think a good exercise would be to stack rank all the linguistic features we would need with clear use case (real life), this way we will be well aware of why we would do certain trade-offs.

nbouvrette avatar Feb 22 '20 20:02 nbouvrette

As an observation, based on the likely outcome of #103 to support top-level selectors with multiple input variables, this is going to create a desire for a human-accessible file format that can support lists of strings as item keys. As far as I know, the only widely-used format that does allow for that is YAML.

Reformatting some of the examples given there by @echeran, @mihnita and myself, here's how they could be expressed (using slightly variable selector function specifications):

plain-message: Do we allow multiple multi-select messages to nest inside one another?

profile-likes:
  select: [ PLURAL(friendsNum), PLURAL(countriesNum), GENDER(user) ]
  cases:
    [ one, one, masculine ]: ${friendsNum} friend from ${countriesNum} country liked his profile.
    [ one, one, feminine ]: ${friendsNum} friend from ${countriesNum} country liked her profile.
    [ one, one, other ]: ${friendsNum} friend from ${countriesNum} country liked their profile.
    [ one, other, masculine ]: ${friendsNum} friend from ${countriesNum} countries liked his profile.
    [ one, other, feminine ]: ${friendsNum} friend from ${countriesNum} countries liked her profile.
    [ one, other, other ]: ${friendsNum} friend from ${countriesNum} countries liked their profile.
    [ other, one, masculine ]: ${friendsNum} friends from ${countriesNum} country liked his profile.
    [ other, one, feminine ]: ${friendsNum} friends from ${countriesNum} country liked her profile.
    [ other, one, other ]: ${friendsNum} friends from ${countriesNum} country liked their profile.
    [ other, other, masculine ]: ${friendsNum} friends from ${countriesNum} countries liked his profile.
    [ other, other, feminine ]: ${friendsNum} friends from ${countriesNum} countries liked her profile.
    [ other, other, other ]: ${friendsNum} friends from ${countriesNum} countries liked their profile.

deleted-files:
  select: [ file_count:plural, dir_count:plural ]
  cases:
    [   =1,    =1]: You deleted one file in one folder!
    [   =1, other]: You deleted one file in ${dir_count} folders!
    [other,    =1]: You deleted ${file_count} files in one folder!
    [other, other]: You deleted ${file_count} files in ${dir_count} folders!

listed-items:
  select: count
  cases:
    one: Listing one item
    other: Listing ${count} items

My suspicion is that going with any other choice than YAML would require us to spend more time defining and building tooling for selector syntax, only to arrive at some custom solution that's going to look really similar. One limitation that YAML does impose is that plain unquoted scalars can't start with the { character, for which reason I'm using e.g. ${count} in the above.

As a further observation, using an externally defined format like YAML can also be thought of as an argument for not allowing in-message selectors, given that we then don't need to define a syntax for them.

eemeli avatar Sep 28 '20 21:09 eemeli

@Janpot & @ray007, could you expand a bit on why you've reacted to the above with a 👎? Is it that you don't think we should specify a file format, that there's a better alternative than YAML, or that you disagree with multi-value selectors being extracted from the source text of the message?

eemeli avatar Sep 30 '20 05:09 eemeli

@eemeli I really dislike the syntax where one has to list all permutations of argument variations manually, it's so easy to miss some this way. And it duplicates a lot of text, which is another possible source of errors and makes it a pain to change. And the [ other, other, masculine ] part is the only thing not easily translatable to json without some extra magic.

ray007 avatar Sep 30 '20 07:09 ray007

Since we target a data model, not a file format, this is probably out of scope.

And if we need to design something at this level, I would rather go with a syntax, not a file format. Or a syntax, and an optional (not mandatory) file format.

One of the main benefits of the "not-a-file-format" approach is that you can store the strings in whatever format is native for the tech stack used. Right now I can put a string with MessageFormat syntax anywhere I want: in a Java .properties file, in Android strings.xml, in iOS .strings, ancient Windows .rc (and in fact a lot of platforms do exactly that)

One might say: who cares, a file format is just a file format, you can mix and match.

But not so. The tech stack does language negotiation and fallback using certain rules (and they are different between platforms) To load a custom ("non-native") file format one would need to implement the exact same fallback logic in the custom loader, or you end up with inconsistent behavior. For example in Android the resource loader is used to retrieve not only localized strings, but also localized images, sound, themes (styles), layouts, you name it.

Then you add the burden to either migrate all strings to the new file format (and update the code that loads the strings), or live with a mixture (most strings stay in strings.xml, but the plural, gender and inflection strings should be in strings.mf2)

This is why I don't even care about the file format issue. If we manage to agree on a data model (a hard enough task :-), I am more than happy to declare "100% mission accomplished" And if we agree on a string syntax (but not a file format), I would declare 130% mission accomplished :-)

mihnita avatar Oct 21 '20 20:10 mihnita

I am closing this issue because we have resolved that resource (file) formats are out of scope. The ABNF includes some features (related to whitespace handling) to help implementations of various resource formats, but we're otherwise agnostic.

If you think this should be re-opened, please consider opening new issues with specific requests/requirements against the syntax or specification indicating what in-scope features are needed (e.g. to support various resource syntaxes).

aphillips avatar Jun 17 '23 15:06 aphillips