RWKV-World-HF-Tokenizer issues

Results 4 RWKV-World-HF-Tokenizer issues

Sort by recently updated

how to create tokenizer.json

I need to use **tokenizer.json** in my project, how should I create it?

Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from huggingface?

Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from https://huggingface.co/RWKV/v5-Eagle-7B-HF/tree/main? I saw that WordpieceTokenizer from tokenization_rwkv5.py uses whitespace_tokenize to split tokens, which seems to be unfriendly to Chinese characters.

shiroko98

Update convert_rwkv6_checkpoint_to_hf.py

Updated 14B Hidden size to 4096

PicoCreator

The tokenizer is 2.5x slower than other huggingface tokenizer and the original blinks world tokenizer

This tokenizer is 2.5x slower than other huggingface tokenizers and the original blinks world tokenizer. The comparison can be tested here: https://colab.research.google.com/gist/cahya-wirawan/932f95ece55c838e186dc3b1c9fcbef4/rwkv-tokenizers.ipynb It generates also difference token ids for following...

cahya-wirawan

RWKV-World-HF-Tokenizer
RWKV-World-HF-Tokenizer copied to clipboard

Metadata

how to create tokenizer.json

Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from huggingface?

Update convert_rwkv6_checkpoint_to_hf.py

The tokenizer is 2.5x slower than other huggingface tokenizer and the original blinks world tokenizer

← Metadata

Owner

Metadata

RWKV-World-HF-Tokenizer RWKV-World-HF-Tokenizer copied to clipboard

Metadata

how to create tokenizer.json

Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from huggingface?

Update convert_rwkv6_checkpoint_to_hf.py

The tokenizer is 2.5x slower than other huggingface tokenizer and the original blinks world tokenizer

← Metadata

Owner

Metadata

RWKV-World-HF-Tokenizer
RWKV-World-HF-Tokenizer copied to clipboard