llama icon indicating copy to clipboard operation
llama copied to clipboard

Why one token corresponds to multiple token ids

Open FinalFlowers opened this issue 1 year ago • 1 comments

f4ce54cf-7ef4-4895-b7e0-9b09df84f711

FinalFlowers avatar Apr 09 '23 05:04 FinalFlowers

It looks like 1217 and 3582 are sub-word tokens:

>>> tokenizer.encode('no', bos = False, eos = False)
[694]
>>> tokenizer.encode('thno', bos = False, eos = False)
[266, 1217]
>>> tokenizer.encode('yes', bos = False, eos = False)
[4874]
>>> tokenizer.encode('thyes', bos = False, eos = False)
[266, 3582]

([266] = "th")

mawilson1234 avatar Apr 10 '23 18:04 mawilson1234