RWKV-World-HF-Tokenizer
RWKV-World-HF-Tokenizer copied to clipboard
I need to use **tokenizer.json** in my project, how should I create it?
Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from https://huggingface.co/RWKV/v5-Eagle-7B-HF/tree/main? I saw that WordpieceTokenizer from tokenization_rwkv5.py uses whitespace_tokenize to split tokens, which seems to be unfriendly to Chinese characters.
Updated 14B Hidden size to 4096
This tokenizer is 2.5x slower than other huggingface tokenizers and the original blinks world tokenizer. The comparison can be tested here: https://colab.research.google.com/gist/cahya-wirawan/932f95ece55c838e186dc3b1c9fcbef4/rwkv-tokenizers.ipynb It generates also difference token ids for following...