minGPT
minGPT copied to clipboard
Should -1 marker (as special token) be counted in vocab_size?
https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/projects/adder/adder.py#L118
https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/projects/adder/adder.py#L89
To my understanding, we don't add negative values to the tokenizer, we just extend vocab, like this:
# gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<endOfText>","<bot>","<human>","<system>"})
decode = lambda l: enc.decode(l)
this just add 4 extra tokenizer tokens to the already ~50000 token vocab you probably could have a negative tokenizer value(a [-1] token), but you would have to customize tiktoken for that, and adding negative value to the tokenizer means you now have to account for a greater fixed size integer set, which I think would make it slower.
tldr: its possible, but people don't really need negative tokens, its just extra work/slower