german_transliterate
german_transliterate copied to clipboard
conversion with information like four cases, numerus, gender?
python core.py 'seit 21.6.2012' seit einundzwanzigste juni zweitausendzwölf
but it should be as follows: seit einundzwanzigstem Juni zweitausendzwölf
or the conversion is not important for natives?
Hi @kkokdari ,
thanks for getting in touch. You're right, from a pure grammar standpoint "seit einundzwanzigstem Juni zweitausendzwölf" is the correct version in German.
The reason why the conversion doesn't get it right with the declination of "einundzwanzigste" is it is dependent on the case (in German: Genitiv, Dativ, Akkusativ, Nominativ) which would require more than just regular expressions or patterns found in the text! I have some heuristics which try to find a "trade-off" but it also gets it wrong many times, of course.
For example, if you replace the word "seit" (means "from" in English) with "am" (would correspond to "on" in English), the declination would be "einundzwanzigste_n" and so forth...
To get a proper case/declination here, it would require a textual understanding or a short list of fixed phrases (e.g. like "seit
Only a semantic (thus, nowadays: machine learning-based) handling could help here, I suppose. But still, this wouldn't get it right in a lot of cases. An example is the highly valued framework SpaCy (https://spacy.io/) which has a parser for German. However, for more complex (i.e. regular :-) sentences and texts it has a high error rate, from my experience...
Conclusion: Too advanced to really make a difference with the current setup of german_transliterate. However, if you would like to inlcude a ML-based solution, let me know and maybe we can find out how to get an acceptable solution (taking into account both performance and correctness).