datatrove
datatrove copied to clipboard
How about addding custom word_tokenizers?
How about addding custom word tokenizer class in utis/word_tokenizers.py
?
the reason is following:
- I just want not to use determined tokenizer(in
word_tokenizers.WORD_TOKENIZER_FACTORY
) but other tokenizer(such as khaiii). - Some other languages can make their own tokenizer with custom tokenizer class.