keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Arabic Tokenizer

Open Abdelrahmanrezk opened this issue 3 years ago • 2 comments

The tokenization structure is different from languages like English or other languages that start from left to right, and those language like Arabic and langues that start from right to left. Addeding new APIs that can work with Arabic language and those who start from right to left can add a lot to kerasNLP and spread it out for supporting Arabic language as Arabic is the fifth most spoken language in the entire world and the fourth most used language on internet. One solution of that is to integrated with the Farasa API tokenizer, as well as other problems related to lemmatization, diacritization. Farasa apis

Other approach is to use sentencepiece project to help doing new learned tokenizer.

Sentencepiece projetc

Abdelrahmanrezk avatar Apr 08 '22 20:04 Abdelrahmanrezk

Yeah, we have an issue open for a sentence piece tokenizer https://github.com/keras-team/keras-nlp/issues/27. I will get to that in the next week or two hopefully!

Re https://farasa.qcri.org/ we could definitely support other approaches, but the trick is we would like tensorflow op level support for our tokenizers (see https://www.tensorflow.org/text/).

Are there key features that sentencepiece does not support? Or will that cover the main use cases.

mattdangerw avatar Apr 11 '22 21:04 mattdangerw

Yes, I got it now, but I was wondering how I can help on like that project from languages that start from right to left, but sentencepiece can be more than enough as it's not depends on the language. If there is a place that I can work on with that project please let me know to submit my proposal, and I if there is something prior to that I can do my best to demonstrate I am hard worker.

Abdelrahmanrezk avatar Apr 12 '22 01:04 Abdelrahmanrezk