huggingface-tokenizer-in-cxx
huggingface-tokenizer-in-cxx copied to clipboard
support additional special tokens
Hello @wangkuiyi ,
It seems this tokenizer only supports one special token "<|endoftext|>".
Does it support other additional special tokens? For instatnce the ones we added in special_tokens_map.json,
like
"<|user|>", "<|assistant|>", "<s>", "</s>" and "<unk>"?
Thanks!
Hi there~ I would also like to ask about adding special tokens. The case is that for some models, such as Qwen1.5, the special tokens are not in the vocab.json or merges.txt at first. They seem to be added later in huggingface Rust tokenizer implementation. Does this repo also support this adding special tokens feature? Thank you.