minGPT icon indicating copy to clipboard operation
minGPT copied to clipboard

Should -1 marker (as special token) be counted in vocab_size?

Open mw66 opened this issue 2 years ago • 1 comments

https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/projects/adder/adder.py#L118

https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/projects/adder/adder.py#L89

mw66 avatar Sep 19 '23 15:09 mw66

To my understanding, we don't add negative values to the tokenizer, we just extend vocab, like this:

# gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<endOfText>","<bot>","<human>","<system>"})
decode = lambda l: enc.decode(l)

this just add 4 extra tokenizer tokens to the already ~50000 token vocab you probably could have a negative tokenizer value(a [-1] token), but you would have to customize tiktoken for that, and adding negative value to the tokenizer means you now have to account for a greater fixed size integer set, which I think would make it slower.

tldr: its possible, but people don't really need negative tokens, its just extra work/slower

VatsaDev avatar Sep 24 '23 16:09 VatsaDev