lingua-franca icon indicating copy to clipboard operation
lingua-franca copied to clipboard

Implement support for getting plural categories

Open filips123 opened this issue 3 years ago • 11 comments

Description

This implementes support for getting plural categories and prepared for future full pliralization functions. It also changes how _translate_word works to make it possible to get correct plural forms in nice_duration.

I added those functions to new utils module, because I think they don't fit completely into existing parse and format. If you think adding them to existing modules would be better, I can move them. Because of #160 I added empty utils modules for for all other languages.

Format of translations in res/text is now changed to use JSON files that contain all translations for all plural forms. However, I kept existing format in most languages (except English and Slovenian) and added "legacy" translation function. That languages need to be converted in the future and legacy function can be deleted.

Currently the only languages that implement plural categories are English and Slovenian, but there is a fallback to English-style pluralization for all other languages for backwards compatibility.

Type of PR

  • [ ] Bugfix
  • [x] Feature implementation
  • [ ] Refactor of code (without functional changes)
  • [ ] Documentation improvements
  • [ ] Test improvements

Testing

There are tests for English and Slovenian language.

Documentation

I added/fixed docstrings where needed and created documentation in readme and project structure files.

filips123 avatar Dec 16 '20 09:12 filips123

i think this belongs in format module, dont think we should create a new utils submodule for this

essentially if it outputs text -> formatting if i receives text and extracts info from it -> parsing

this clearly belongs in the format module

i would also prefer if this came from a single dict, if we create a .json file for each category it quickly gets out of hand, connectors.json only has 2 entries for example.... I would suggest this becomes a simple pluralization.json file, if you feel the need add categories as top level keys inside that single json file

a .json file also does not scale well, so logic should be available, but thats for another future PR #37 #36

JarbasAl avatar Dec 16 '20 09:12 JarbasAl

I moved functions into format module and all translations into translations.json.

However, there is some problem. Some tests in test/test_format_sl.py fail because it looks like lingua-franca does not detect get_plural_category_sl in lingua_franca/lang/format_sl.py. This also happens locally, but if I just run that specific file with pytest test/test_format_sl.py, everything works.

I also added encoding="utf8" to all file open calls, because Python by default uses system encoding which is not always UTF-8 (for example on Windows) so tests can fail.

filips123 avatar Dec 17 '20 18:12 filips123

There's some boilerplate at the top of the other tests, which is required. Your tests pass once you paste it in there. It's this part:

class TestPronounceNumber(unittest.TestCase):
    def setUp(self):
        self.old_lang = get_default_lang()
        set_default_lang("sl-si")


    def tearDown(self):
        if self.old_lang:
            set_default_lang(self.old_lang)

Basically, there's no way to guarantee the order in which pytest will run tests. In order to be absolutely, 100% certain that the language you're testing is the default language at the moment the test runs, we set and then reset it in each test. That is, when a given test class is finished, the default language goes back to what it had been before the test ran. It's not a perfect solution, but it keeps the paradigm from breaking Pytest.

Hence, at the moment, with that boilerplate missing, your new tests are trying to run with some other default language, one in which the function really hasn't been localized.

In a perfect world, we'd just unload all languages after every test, so that this problem would throw a slightly more straightforward "no language module loaded." However, that didn't resolve the problem when I was writing the localizer. I should revisit that...

ChanceNCounter avatar Dec 17 '20 18:12 ChanceNCounter

Looks like some of the English and Slovenian res files got nuked when you renamed that file (hello, fellow regex person! 😁)

ChanceNCounter avatar Dec 17 '20 21:12 ChanceNCounter

nitpick, can translations.json be renamed to something more explicit? maybe pluralizations.json?

JarbasAl avatar Dec 18 '20 12:12 JarbasAl

Now having more time to review, I'm not sure pluralizations is a better name, because it now also houses the singlar forms of critical vocabulary.

Perhaps something like common_words.json or similar (I'm not exactly married to that, either =P) and the localization docs can just make it explicit that this contains important vocabulary singular and plural forms.

However, (@JarbasAl @jmontane) I wonder what implications this approach might have down the road for languages that need the "variant" framework.

ChanceNCounter avatar Dec 20 '20 23:12 ChanceNCounter

@JarbasAl bump re: compatibility with variants, possible need for integration (plural variants?)

ChanceNCounter avatar Dec 30 '20 22:12 ChanceNCounter

Let's chime in.

As is you'd need to pass an english singular to get the localized plural. I guess there are some good cross-language arguments. Lets say you want to pluralize some skill vocabulary (native language) - you're out of luck

The local function isn't used widely, so not a deep code. Why not step away from _translate_word? i get it, as a mycroft-core component, but LF?

Please tell me if i'm missing something.

emphasize avatar Jul 13 '22 21:07 emphasize

This PR has been inactive for quite some time, so I kinda forgot about it, but I think it would still be useful. I think one of the things that weren't determined yet was how to name pluralizations.json/translations.json. If that can be resolved and maintainers think the PR would be useful, I can try to rebase it and resolve conflicts.

The local function isn't used widely, so not a deep code. Why not step away from _translate_word?

_translate_word is only used internally from nice_duration and join_list to translate (and correctly pluralize) a few words that are used in these functions. It is not meant to be used by external skills.

A similar function to translate words already exists (and has been renamed to _translate_word_legacy in this PR), but only supports simple singular/plural form, so it does not work for languages with more complex plural forms. With this PR, _translate_word should now work with most languages, using Unicode CLDR Plural Rules.

Lets say you want to pluralize some skill vocabulary (native language)

In this case, you should use get_plural_category. You will get Unicode CLDR plural category and will then need to handle it according to your language's rules. You would probably have a dictionary of plural forms for all words and languages you need, and then just lookup it using category you get from get_plural_category.

There is also get_plural_form function, which is meant to get the correct plural form of any (native) word. However, because proper pluralization for arbitrary words is hard to do, it probably won't actually be implemented for any language for quite some time. So, you should probably use get_plural_category and do pluralization yourself, only for words you actually need.

i get it, as a mycroft-core component, but LF?

If you meant why _translate_word is in LF, it's because it's only an internal function used by other LF functions, so it wouldn't make sense it's in mycroft-core.

If you meant why get_plural_category and get_plural_form are in LF, it's because I think pluralization counts under "multilingual text parsing and formatting".

filips123 avatar Jul 14 '22 11:07 filips123

Hey there, I haven't looked at this deeply but definitely agree with the premise.

We're very close to launching the Mark II so keeping our focus quite specifically on that, but I'd like to come back to this once we hit code freeze on the Mark II.

I won't try to present an opinion on the questions raised until I've properly digested this. I would however say that I'm not opposed to more modules in LF beyond parse and format. Where the functions from this PR live is another question but very open to a util module if that makes more sense for this.

krisgesling avatar Jul 15 '22 00:07 krisgesling

I approve these changes and cherry picked these commits for ovos, so far i only did minor changes like adding an enum and porting the old singularize/pluralize prs for english/portuguese, but i still want to change a few more things like renaming "type" to something else to avoid the namespace collision with the builtin type etc.

might be interesting to keep an eye on https://github.com/OpenVoiceOS/ovos-lingua-franca/pull/28 , for what its worth i think this PR is looking great already!

JarbasAl avatar Jul 15 '22 02:07 JarbasAl