message-format-wg icon indicating copy to clipboard operation
message-format-wg copied to clipboard

Support for inflections (cases)

Open nbouvrette opened this issue 4 years ago • 7 comments

This thread is a spin-off of the conversation that began in requirements gathering (issue #3) about what would be good solutions to offer better inflections (genders, articles, declensions, etc.) support:

To better understand inflections, please listen to the video by @grhoten: Let's Come To An Agreement About Our Words

See previous comments:

Reference material:

nbouvrette avatar Jan 25 '20 20:01 nbouvrette

I'd like to kick off this thread maybe with a few questions which I am sure some of you probably have ideas around already:

  1. Do we have clear examples of how inflection syntaxes are used today and how common/useful solving this problem can be? There have been a few mentions that this problem is quite large and might not be used widely - it would be good to start by clarifying this point!
  2. Do we know how large is this problem and how much work it would take to solve holistically (presuming we focus on top languages)?
  3. What inflection problems can be solved today and which ones still remain to be solved?
  4. Can we see a few examples in different syntax on how most common examples can be solved today and the pros and cons of each approach?
  5. How do existing approaches fit into the current TMS/CAT landscape?

nbouvrette avatar Jan 25 '20 20:01 nbouvrette

When it comes to word inflection, it's primarily important when encountering user vocabulary in an entire sentence. If you're just doing a label and field UI, it's unimportant. Here are some examples:

Your ${device} is on.
  • In Arabic and Hebrew, "your" needs to be inflected depending on the gender of who you are. Otherwise you have to rephrase it in a less natural way. The pronoun also morphologically attaches to the device variable in Arabic. It's a bound morpheme that is not separated by whitespace.
  • In English, the "is" needs to be inflected depending on whether the device is singular or plural. It could be "Your light is on" or "Your lights are on". Though I can get more complicated with this example in English.
  • In French, the "on" depends on the grammatical gender of the variable named device.
  • In some languages, they may want to make the device variable definite.
  • This assumes that the "device" variable is defined by the user and maybe the application. If the set of values from the device variable was bounded, then you would split the message with hard coded values for the variable.
${number} ${item} were found.
  • This also assumes that the "item" variable is defined by the user and maybe the application. If the set of values from the item variable was bounded, then you would split the message with hard coded values for the variable.
  • The item variable will need to be inflected to the plural form. For Russian and Arabic, this is a complex table because case will need to be involved too.
  • The "were" depends on the value of number in addition to the item variable.
  • If the "number" variable needs to be pronounced (turn the digits form into words) , then you likely need to use RBNF (rule based number format). The number typically needs to agree with the case and gender of the noun. The grammatical number of the noun needs to agree with the value of the number.

I can go on for a while on this topic for how Siri solved it. If you need more precise examples, then it might be best if I did a presentation about it.

grhoten avatar Jan 27 '20 19:01 grhoten

@grhoten very good examples - I would definitely like a presentation on this topic and I am sure others would also appreciate learning more about how you solved this problem. What do you think would be the best way to organize this?

Also, from your experience, do you feel there is a lot of need around solving this problem or this is more of a niche that most companies might not need in their toolsets? I still believe there are probably common infections that could be useful for the general public (e.g. indefinite article in English would be one I am guessing could be used a lot).

nbouvrette avatar Feb 01 '20 13:02 nbouvrette

Just a fun video. Not really about inflections, but just try to think: how would I handle this: https://www.youtube.com/watch?v=YY9qjqMdUDs :-)

mihnita avatar Feb 15 '20 01:02 mihnita

I think your video reinforces why inflection handling should be done by language pairs :) Also, Nahuatl is only spoken by 1.5 million worldwide - probably would end up being pretty low on the backlog I would think in terms of which rule we would support first.

nbouvrette avatar Feb 16 '20 14:02 nbouvrette

That video was for fun, but the point was not about Nahuatl.

It the same "bucket" you have Finnish, Hungarian, Turkish, Georgian, Japanese: https://en.wikipedia.org/wiki/Synthetic_language#Relational_synthesis

Wikipedia also lists Spanish and Italian there, but they have it to a to lesser degree. Romanian works the same as Italian (with worse inflection :-)

mihnita avatar Feb 17 '20 19:02 mihnita

I wonder if we should try to stack rank which linguistic challenges are the most common and how to prioritize them. I could spot a few linguistic related threads so far (could be good to tag them as well):

Also improving existing features:

nbouvrette avatar Feb 18 '20 13:02 nbouvrette

As mentioned in today's telecon (2023-09-18), closing old requirements issues.

aphillips avatar Sep 18 '23 19:09 aphillips