datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

How about addding custom word_tokenizers?

Open aiqwe opened this issue 7 months ago • 0 comments

How about addding custom word tokenizer class in utis/word_tokenizers.py?

the reason is following:

  • I just want not to use determined tokenizer(in word_tokenizers.WORD_TOKENIZER_FACTORY) but other tokenizer(such as khaiii).
  • Some other languages can make their own tokenizer with custom tokenizer class.

aiqwe avatar Jul 17 '24 09:07 aiqwe