minGPT icon indicating copy to clipboard operation
minGPT copied to clipboard

<|endoftext|> token isn't encoded correctly

Open ttumiel opened this issue 1 week ago • 0 comments

import torch
from mingpt.bpe import BPETokenizer

tokenizer = BPETokenizer()
print(tokenizer("<|endoftext|>")) # tensor([[  27,   91,  437, 1659, 5239,   91,   29]])
print(tokenizer.decode(torch.tensor([50256]))) # '<|endoftext|>'
print(tokenizer(tokenizer.decode(torch.tensor([50256])))) # tensor([[  27,   91,  437, 1659, 5239,   91,   29]])

ttumiel avatar Jun 27 '24 11:06 ttumiel