lingua-franca
lingua-franca copied to clipboard
Implement support for getting plural categories
Description
This implementes support for getting plural categories and prepared for future full pliralization functions. It also changes how _translate_word
works to make it possible to get correct plural forms in nice_duration
.
I added those functions to new utils
module, because I think they don't fit completely into existing parse
and format
. If you think adding them to existing modules would be better, I can move them. Because of #160 I added empty utils
modules for for all other languages.
Format of translations in res/text
is now changed to use JSON files that contain all translations for all plural forms. However, I kept existing format in most languages (except English and Slovenian) and added "legacy" translation function. That languages need to be converted in the future and legacy function can be deleted.
Currently the only languages that implement plural categories are English and Slovenian, but there is a fallback to English-style pluralization for all other languages for backwards compatibility.
Type of PR
- [ ] Bugfix
- [x] Feature implementation
- [ ] Refactor of code (without functional changes)
- [ ] Documentation improvements
- [ ] Test improvements
Testing
There are tests for English and Slovenian language.
Documentation
I added/fixed docstrings where needed and created documentation in readme and project structure files.
i think this belongs in format module, dont think we should create a new utils
submodule for this
essentially if it outputs text -> formatting if i receives text and extracts info from it -> parsing
this clearly belongs in the format module
i would also prefer if this came from a single dict, if we create a .json file for each category it quickly gets out of hand, connectors.json only has 2 entries for example.... I would suggest this becomes a simple pluralization.json file, if you feel the need add categories as top level keys inside that single json file
a .json file also does not scale well, so logic should be available, but thats for another future PR #37 #36
I moved functions into format
module and all translations into translations.json
.
However, there is some problem. Some tests in test/test_format_sl.py
fail because it looks like lingua-franca does not detect get_plural_category_sl
in lingua_franca/lang/format_sl.py
. This also happens locally, but if I just run that specific file with pytest test/test_format_sl.py
, everything works.
I also added encoding="utf8"
to all file open calls, because Python by default uses system encoding which is not always UTF-8 (for example on Windows) so tests can fail.
There's some boilerplate at the top of the other tests, which is required. Your tests pass once you paste it in there. It's this part:
class TestPronounceNumber(unittest.TestCase):
def setUp(self):
self.old_lang = get_default_lang()
set_default_lang("sl-si")
def tearDown(self):
if self.old_lang:
set_default_lang(self.old_lang)
Basically, there's no way to guarantee the order in which pytest will run tests. In order to be absolutely, 100% certain that the language you're testing is the default language at the moment the test runs, we set and then reset it in each test. That is, when a given test class is finished, the default language goes back to what it had been before the test ran. It's not a perfect solution, but it keeps the paradigm from breaking Pytest.
Hence, at the moment, with that boilerplate missing, your new tests are trying to run with some other default language, one in which the function really hasn't been localized.
In a perfect world, we'd just unload all languages after every test, so that this problem would throw a slightly more straightforward "no language module loaded." However, that didn't resolve the problem when I was writing the localizer. I should revisit that...
Looks like some of the English and Slovenian res files got nuked when you renamed that file (hello, fellow regex person! 😁)
nitpick, can translations.json
be renamed to something more explicit? maybe pluralizations.json
?
Now having more time to review, I'm not sure pluralizations
is a better name, because it now also houses the singlar forms of critical vocabulary.
Perhaps something like common_words.json
or similar (I'm not exactly married to that, either =P) and the localization docs can just make it explicit that this contains important vocabulary singular and plural forms.
However, (@JarbasAl @jmontane) I wonder what implications this approach might have down the road for languages that need the "variant" framework.
@JarbasAl bump re: compatibility with variants, possible need for integration (plural variants?)
Let's chime in.
As is you'd need to pass an english singular to get the localized plural. I guess there are some good cross-language arguments. Lets say you want to pluralize some skill vocabulary (native language) - you're out of luck
The local function isn't used widely, so not a deep code. Why not step away from _translate_word
? i get it, as a mycroft-core component, but LF?
Please tell me if i'm missing something.
This PR has been inactive for quite some time, so I kinda forgot about it, but I think it would still be useful. I think one of the things that weren't determined yet was how to name pluralizations.json
/translations.json
. If that can be resolved and maintainers think the PR would be useful, I can try to rebase it and resolve conflicts.
The local function isn't used widely, so not a deep code. Why not step away from
_translate_word
?
_translate_word
is only used internally from nice_duration
and join_list
to translate (and correctly pluralize) a few words that are used in these functions. It is not meant to be used by external skills.
A similar function to translate words already exists (and has been renamed to _translate_word_legacy
in this PR), but only supports simple singular/plural form, so it does not work for languages with more complex plural forms. With this PR, _translate_word
should now work with most languages, using Unicode CLDR Plural Rules.
Lets say you want to pluralize some skill vocabulary (native language)
In this case, you should use get_plural_category
. You will get Unicode CLDR plural category and will then need to handle it according to your language's rules. You would probably have a dictionary of plural forms for all words and languages you need, and then just lookup it using category you get from get_plural_category
.
There is also get_plural_form
function, which is meant to get the correct plural form of any (native) word. However, because proper pluralization for arbitrary words is hard to do, it probably won't actually be implemented for any language for quite some time. So, you should probably use get_plural_category
and do pluralization yourself, only for words you actually need.
i get it, as a mycroft-core component, but LF?
If you meant why _translate_word
is in LF, it's because it's only an internal function used by other LF functions, so it wouldn't make sense it's in mycroft-core.
If you meant why get_plural_category
and get_plural_form
are in LF, it's because I think pluralization counts under "multilingual text parsing and formatting".
Hey there, I haven't looked at this deeply but definitely agree with the premise.
We're very close to launching the Mark II so keeping our focus quite specifically on that, but I'd like to come back to this once we hit code freeze on the Mark II.
I won't try to present an opinion on the questions raised until I've properly digested this. I would however say that I'm not opposed to more modules in LF beyond parse and format. Where the functions from this PR live is another question but very open to a util
module if that makes more sense for this.
I approve these changes and cherry picked these commits for ovos, so far i only did minor changes like adding an enum and porting the old singularize/pluralize prs for english/portuguese, but i still want to change a few more things like renaming "type" to something else to avoid the namespace collision with the builtin type etc.
might be interesting to keep an eye on https://github.com/OpenVoiceOS/ovos-lingua-franca/pull/28 , for what its worth i think this PR is looking great already!