keras-nlp
keras-nlp copied to clipboard
Adding an example to use a ByteLevelTokenizer instead of implementing a language specific tokenizer
Instead of implementing language specific tokenizers, we can try to look at more robust way of dealing with such problems. We can maybe look at a more robust way, like using a ByteTokenizer instead of creating different tokenizers, and for this purpose we can try to implement an example of how we can use a ByteTokenizer for different language(s), just like how HuggingFace does it in their blog where they train their EsperBERTo. Again, this is just an alternative which might help reduce some work to be done, and I have not yet tested this claim, because it may be that implementing a LanguageSpecificTokenizer work better than a ByteLevelTokenizer, but I suppose this might be a better solution.
I was mainly inspired by a similar issue I found for huggingface while I was looking around, and thought that a similar example for keras-nlp would be quite beneficial.
Would love to know your views on this!