Vaaku2Vec [QUESTION] About the Tokenizer

[QUESTION] About the Tokenizer

Open loretoparisi opened this issue 6 years ago • 2 comments

For a romanization project I'm working on I'm using the polyglot-tokenizer with good results for most of the indian languages. Are you aware of it? My question is if the NLTK is better in the tokenization. Thank you.

Feb 05 '19 10:02 loretoparisi

Thanks for the suggestion. I tried using it. There are no significant improvements for Malayalam. Maybe it works well for other languages mentioned. screen shot 2019-02-06 at 4 06 47 pm

Feb 06 '19 10:02 adamshamsudeen

Thank you for testing it! I was aware of NLTK, but at the end I have preferred that one because of the extended indian languages support. Currently I'm looking to the Byte Pair Encoding approach in order to get rid of a specific language model, so to build a cross-lingual model. I saw you are working on the same. Also in my case I do indian language classification (the source is Wikipedia as well for indian script languages), and most of the problems were due to the Tokenizer actually, more than the classifier itself. Hopefully the BPE will give better results for a language agnostic approach!

Feb 06 '19 17:02 loretoparisi

Vaaku2Vec Vaaku2Vec copied to clipboard

[QUESTION] About the Tokenizer

Vaaku2Vec
Vaaku2Vec copied to clipboard