german_transliterate conversion with information like four cases, numerus, gender?

conversion with information like four cases, numerus, gender?

Open kkokdari opened this issue 2 years ago • 1 comments

python core.py 'seit 21.6.2012' seit einundzwanzigste juni zweitausendzwölf

but it should be as follows: seit einundzwanzigstem Juni zweitausendzwölf

or the conversion is not important for natives?

Aug 31 '21 03:08 kkokdari

Hi @kkokdari ,

thanks for getting in touch. You're right, from a pure grammar standpoint "seit einundzwanzigstem Juni zweitausendzwölf" is the correct version in German.

The reason why the conversion doesn't get it right with the declination of "einundzwanzigste" is it is dependent on the case (in German: Genitiv, Dativ, Akkusativ, Nominativ) which would require more than just regular expressions or patterns found in the text! I have some heuristics which try to find a "trade-off" but it also gets it wrong many times, of course.

For example, if you replace the word "seit" (means "from" in English) with "am" (would correspond to "on" in English), the declination would be "einundzwanzigste_n" and so forth...

To get a proper case/declination here, it would require a textual understanding or a short list of fixed phrases (e.g. like "seit " or "am " etc.). The latter would still be somehow error-prone, I am afraid.

Only a semantic (thus, nowadays: machine learning-based) handling could help here, I suppose. But still, this wouldn't get it right in a lot of cases. An example is the highly valued framework SpaCy (https://spacy.io/) which has a parser for German. However, for more complex (i.e. regular :-) sentences and texts it has a high error rate, from my experience...

Conclusion: Too advanced to really make a difference with the current setup of german_transliterate. However, if you would like to inlcude a ML-based solution, let me know and maybe we can find out how to get an acceptable solution (taking into account both performance and correctness).

Sep 03 '21 09:09 repodiac

german_transliterate german_transliterate copied to clipboard

conversion with information like four cases, numerus, gender?

german_transliterate
german_transliterate copied to clipboard