tokenmonster icon indicating copy to clipboard operation
tokenmonster copied to clipboard

Special tokens not showing up correctly when tokenized.

Open amazingvince opened this issue 1 year ago • 1 comments

I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.

How do I make sure my special tokens always map to the same id? code to reproduce what I am seeing:

vocab = tokenmonster.load("englishcode-32000-consistent-v1")

vocab.modify(["<|im_start|>", "<|im_end|>", "<s>"], None, None, 0)


vocab.resize(32000, reset_token_ids=False)


# Tokenize some text
text = [
    "<s>Some text to turn into token IDs. Why is this happening?<|im_end|>",
    "<s>Some text to turn into token IDs. <|im_end|>",
    "<s>Some text to turn into token IDs....<|im_end|>",
]

amazingvince avatar Nov 05 '23 00:11 amazingvince

It's unclear what you're trying to do, what you expect to happen, and what is happening. Please provide the results of what you get, and a description of what you expected to get.

alasdairforsythe avatar Nov 15 '23 02:11 alasdairforsythe