indic_nlp_library icon indicating copy to clipboard operation
indic_nlp_library copied to clipboard

Transliteration not proper for few characters in Tamil

Open vrindaprabhu opened this issue 8 years ago • 7 comments

Please find the below code for transliterating from Tamil to English.

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text = u'ஒன்றுமட்டுமல்லாது'
lang='ta'
input_text = ItransTransliterator.to_itrans(input_text,lang)
print input_text
#OUTPUT : .oऩRumaTTumallAtu

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
lang='ta'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
#OUTPUT :  ஒனறுமட்டுமல்லாது

vrindaprabhu avatar Oct 06 '16 12:10 vrindaprabhu

Thanks for pointing out. The extended ITRANS standard we defined does not probably have a mapping for this character. I will check this over the weekend.

anoopkunchukuttan avatar Oct 06 '16 12:10 anoopkunchukuttan

I wonder how this transliteration compares to open-tamil package.

Anoop would you be publishing this package on python pkg repository? Where are your unittests for this project, I can't seem to find it.

arcturusannamalai avatar Oct 22 '16 01:10 arcturusannamalai

The open-tamil package too has some problems handling the unicodes. You will have to explicitly type out in Tamil to get the best results.Discrepancy I faced is like so -

unicode("தொ","utf-8")
#OUTPUT : u'\u0ba4\u0bc6\u0bbe'

tamil_letter = utf8.get_letters("தொ")
utf_tamil = ''.join(tamil_letter).decode("utf-8")
#OUTPUT : u'\u0ba4\u0bca'

I have used open-tamil package.In both scenarios source of the letters were different i.e. different texts.

vrindaprabhu avatar Oct 28 '16 14:10 vrindaprabhu

@vrindaprabhu - please create a suitable issue and we can address it. Also http://libindic.org/ has interesting code bits.

arcturusannamalai avatar Oct 29 '16 03:10 arcturusannamalai

@vrindaprabhu - I checked on Python3 and Open-Tamil version 0.51, I'm not seeing this issue you report. get_letters() returns just 1 letter as element of list.

arcturusannamalai avatar Oct 30 '16 00:10 arcturusannamalai

Strange. Probably like I mentioned it depends on how "தொ" is written. Even I did not face the issue all the time but only with few particular sentences in the corpus.

vrindaprabhu avatar Nov 02 '16 10:11 vrindaprabhu

@vrindaprabhu - there are unicode normalization issues and these are fixed in version 0.65.

arcturusannamalai avatar Nov 03 '16 05:11 arcturusannamalai