keras-nlp
keras-nlp copied to clipboard
Arabic Tokenizer
The tokenization structure is different from languages like English or other languages that start from left to right, and those language like Arabic and langues that start from right to left. Addeding new APIs that can work with Arabic language and those who start from right to left can add a lot to kerasNLP and spread it out for supporting Arabic language as Arabic is the fifth most spoken language in the entire world and the fourth most used language on internet. One solution of that is to integrated with the Farasa API tokenizer, as well as other problems related to lemmatization, diacritization. Farasa apis
Other approach is to use sentencepiece project to help doing new learned tokenizer.
Yeah, we have an issue open for a sentence piece tokenizer https://github.com/keras-team/keras-nlp/issues/27. I will get to that in the next week or two hopefully!
Re https://farasa.qcri.org/ we could definitely support other approaches, but the trick is we would like tensorflow op level support for our tokenizers (see https://www.tensorflow.org/text/).
Are there key features that sentencepiece does not support? Or will that cover the main use cases.
Yes, I got it now, but I was wondering how I can help on like that project from languages that start from right to left, but sentencepiece can be more than enough as it's not depends on the language. If there is a place that I can work on with that project please let me know to submit my proposal, and I if there is something prior to that I can do my best to demonstrate I am hard worker.