huggingface-tokenizer-in-cxx icon indicating copy to clipboard operation
huggingface-tokenizer-in-cxx copied to clipboard

support additional special tokens

Open Derekglk opened this issue 2 years ago • 1 comments

Hello @wangkuiyi ,

It seems this tokenizer only supports one special token "<|endoftext|>". Does it support other additional special tokens? For instatnce the ones we added in special_tokens_map.json, like "<|user|>", "<|assistant|>", "<s>", "</s>" and "<unk>"?

Thanks!

Derekglk avatar Nov 16 '23 07:11 Derekglk

Hi there~ I would also like to ask about adding special tokens. The case is that for some models, such as Qwen1.5, the special tokens are not in the vocab.json or merges.txt at first. They seem to be added later in huggingface Rust tokenizer implementation. Does this repo also support this adding special tokens feature? Thank you.

PengWenChen avatar May 03 '24 06:05 PengWenChen