message-format-wg
message-format-wg copied to clipboard
Support for inflections (cases)
This thread is a spin-off of the conversation that began in requirements gathering (issue #3) about what would be good solutions to offer better inflections (genders, articles, declensions, etc.) support:
To better understand inflections, please listen to the video by @grhoten: Let's Come To An Agreement About Our Words
See previous comments:
- First question on from @nbouvrette
- Answer from @zbraniecki
- Second question from @nbouvrette
- Answer from @mihnita
Reference material:
I'd like to kick off this thread maybe with a few questions which I am sure some of you probably have ideas around already:
- Do we have clear examples of how inflection syntaxes are used today and how common/useful solving this problem can be? There have been a few mentions that this problem is quite large and might not be used widely - it would be good to start by clarifying this point!
- Do we know how large is this problem and how much work it would take to solve holistically (presuming we focus on top languages)?
- What inflection problems can be solved today and which ones still remain to be solved?
- Can we see a few examples in different syntax on how most common examples can be solved today and the pros and cons of each approach?
- How do existing approaches fit into the current TMS/CAT landscape?
When it comes to word inflection, it's primarily important when encountering user vocabulary in an entire sentence. If you're just doing a label and field UI, it's unimportant. Here are some examples:
Your ${device} is on.
- In Arabic and Hebrew, "your" needs to be inflected depending on the gender of who you are. Otherwise you have to rephrase it in a less natural way. The pronoun also morphologically attaches to the device variable in Arabic. It's a bound morpheme that is not separated by whitespace.
- In English, the "is" needs to be inflected depending on whether the device is singular or plural. It could be "Your light is on" or "Your lights are on". Though I can get more complicated with this example in English.
- In French, the "on" depends on the grammatical gender of the variable named device.
- In some languages, they may want to make the device variable definite.
- This assumes that the "device" variable is defined by the user and maybe the application. If the set of values from the device variable was bounded, then you would split the message with hard coded values for the variable.
${number} ${item} were found.
- This also assumes that the "item" variable is defined by the user and maybe the application. If the set of values from the item variable was bounded, then you would split the message with hard coded values for the variable.
- The item variable will need to be inflected to the plural form. For Russian and Arabic, this is a complex table because case will need to be involved too.
- The "were" depends on the value of number in addition to the item variable.
- If the "number" variable needs to be pronounced (turn the digits form into words) , then you likely need to use RBNF (rule based number format). The number typically needs to agree with the case and gender of the noun. The grammatical number of the noun needs to agree with the value of the number.
I can go on for a while on this topic for how Siri solved it. If you need more precise examples, then it might be best if I did a presentation about it.
@grhoten very good examples - I would definitely like a presentation on this topic and I am sure others would also appreciate learning more about how you solved this problem. What do you think would be the best way to organize this?
Also, from your experience, do you feel there is a lot of need around solving this problem or this is more of a niche that most companies might not need in their toolsets? I still believe there are probably common infections that could be useful for the general public (e.g. indefinite article in English would be one I am guessing could be used a lot).
Just a fun video. Not really about inflections, but just try to think: how would I handle this: https://www.youtube.com/watch?v=YY9qjqMdUDs :-)
I think your video reinforces why inflection handling should be done by language pairs :) Also, Nahuatl is only spoken by 1.5 million worldwide - probably would end up being pretty low on the backlog I would think in terms of which rule we would support first.
That video was for fun, but the point was not about Nahuatl.
It the same "bucket" you have Finnish, Hungarian, Turkish, Georgian, Japanese: https://en.wikipedia.org/wiki/Synthetic_language#Relational_synthesis
Wikipedia also lists Spanish and Italian there, but they have it to a to lesser degree. Romanian works the same as Italian (with worse inflection :-)
I wonder if we should try to stack rank which linguistic challenges are the most common and how to prioritize them. I could spot a few linguistic related threads so far (could be good to tag them as well):
- Support list handling
- Text transformations (e.g. title case, capitalize first letter, lower case)
- Provide a way to get a number and noun into grammatical agreement.
Also improving existing features:
As mentioned in today's telecon (2023-09-18), closing old requirements issues.