indic_nlp_library
indic_nlp_library copied to clipboard
Transliteration not proper for few characters in Tamil
Please find the below code for transliterating from Tamil to English.
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
input_text = u'ஒன்றுமட்டுமல்லாது'
lang='ta'
input_text = ItransTransliterator.to_itrans(input_text,lang)
print input_text
#OUTPUT : .oऩRumaTTumallAtu
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
lang='ta'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
#OUTPUT : ஒனறுமட்டுமல்லாது
Thanks for pointing out. The extended ITRANS standard we defined does not probably have a mapping for this character. I will check this over the weekend.
I wonder how this transliteration compares to open-tamil package.
Anoop would you be publishing this package on python pkg repository? Where are your unittests for this project, I can't seem to find it.
The open-tamil package too has some problems handling the unicodes. You will have to explicitly type out in Tamil to get the best results.Discrepancy I faced is like so -
unicode("தொ","utf-8")
#OUTPUT : u'\u0ba4\u0bc6\u0bbe'
tamil_letter = utf8.get_letters("தொ")
utf_tamil = ''.join(tamil_letter).decode("utf-8")
#OUTPUT : u'\u0ba4\u0bca'
I have used open-tamil package.In both scenarios source of the letters were different i.e. different texts.
@vrindaprabhu - please create a suitable issue and we can address it. Also http://libindic.org/ has interesting code bits.
@vrindaprabhu - I checked on Python3 and Open-Tamil version 0.51, I'm not seeing this issue you report. get_letters() returns just 1 letter as element of list.
Strange. Probably like I mentioned it depends on how "தொ" is written. Even I did not face the issue all the time but only with few particular sentences in the corpus.
@vrindaprabhu - there are unicode normalization issues and these are fixed in version 0.65.