llama
llama copied to clipboard
Why one token corresponds to multiple token ids
It looks like 1217 and 3582 are sub-word tokens:
>>> tokenizer.encode('no', bos = False, eos = False)
[694]
>>> tokenizer.encode('thno', bos = False, eos = False)
[266, 1217]
>>> tokenizer.encode('yes', bos = False, eos = False)
[4874]
>>> tokenizer.encode('thyes', bos = False, eos = False)
[266, 3582]
([266]
= "th"
)