Vaaku2Vec icon indicating copy to clipboard operation
Vaaku2Vec copied to clipboard

[QUESTION] About the Tokenizer

Open loretoparisi opened this issue 6 years ago • 2 comments

For a romanization project I'm working on I'm using the polyglot-tokenizer with good results for most of the indian languages. Are you aware of it? My question is if the NLTK is better in the tokenization. Thank you.

loretoparisi avatar Feb 05 '19 10:02 loretoparisi

Thanks for the suggestion. I tried using it. There are no significant improvements for Malayalam. Maybe it works well for other languages mentioned. screen shot 2019-02-06 at 4 06 47 pm

adamshamsudeen avatar Feb 06 '19 10:02 adamshamsudeen

Thank you for testing it! I was aware of NLTK, but at the end I have preferred that one because of the extended indian languages support. Currently I'm looking to the Byte Pair Encoding approach in order to get rid of a specific language model, so to build a cross-lingual model. I saw you are working on the same. Also in my case I do indian language classification (the source is Wikipedia as well for indian script languages), and most of the problems were due to the Tokenizer actually, more than the classifier itself. Hopefully the BPE will give better results for a language agnostic approach!

loretoparisi avatar Feb 06 '19 17:02 loretoparisi