tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

How to write custom Wordpiece class?

Open xinyinan9527 opened this issue 9 months ago • 2 comments

My aim is get the rwkv5 model‘s "tokenizer.json",but it implemented through slow tokenizer(class Pretrainedtokenizer). I want to convert "slow tokenizer" to "fast tokenizer",it needs to use "tokenizer = Tokenizer(Wordpiece())",but rwkv5 has it‘s own Wordpiece file. So I want to create a custom Wordpiece

the code is here


from tokenizers.models import Model
class MyWordpiece(Model):
    def __init__(self,vocab,unk_token):
        self.vocab = vocab
        self.unk_token = unk_token



test = MyWordpiece('./vocab.txt',"<s>")

Traceback (most recent call last):
  File "test.py", line 78, in <module>
    test = MyWordpiece('./vocab.txt',"<s>")
TypeError: Model.__new__() takes 0 positional arguments but 2 were given

xinyinan9527 avatar May 09 '24 03:05 xinyinan9527

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jun 09 '24 01:06 github-actions[bot]

Hey! That is not really the way to make it ! Are you still interested in having the fast version?

ArthurZucker avatar Jun 11 '24 13:06 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 13 '24 01:07 github-actions[bot]