tokenizers
tokenizers copied to clipboard
How to write custom Wordpiece class?
My aim is get the rwkv5 model‘s "tokenizer.json",but it implemented through slow tokenizer(class Pretrainedtokenizer). I want to convert "slow tokenizer" to "fast tokenizer",it needs to use "tokenizer = Tokenizer(Wordpiece())",but rwkv5 has it‘s own Wordpiece file. So I want to create a custom Wordpiece
the code is here
from tokenizers.models import Model
class MyWordpiece(Model):
def __init__(self,vocab,unk_token):
self.vocab = vocab
self.unk_token = unk_token
test = MyWordpiece('./vocab.txt',"<s>")
Traceback (most recent call last):
File "test.py", line 78, in <module>
test = MyWordpiece('./vocab.txt',"<s>")
TypeError: Model.__new__() takes 0 positional arguments but 2 were given
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hey! That is not really the way to make it ! Are you still interested in having the fast version?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.