llama3
llama3 copied to clipboard
missed double-tab merge opportunities in the tokenizer
I was playing with the tokenizer, and I noticed some missed merge opportunities.
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer(['\t', '\t\t', '-\t\t', '\t\t-'])
{'input_ids': [[128000, 197], [128000, 298], [128000, 12, 298], [128000, 197, 197, 12]], 'attention_mask': [[1, 1], [1, 1], [1, 1, 1], [1, 1, 1, 1]]}
Observe:
- tab is 197
- tab tab is 298
- but tab tab is not merged when followed by "-"
This is probably a consequence of the how regex splits, and thus in some sense not a bug...but it is somewhat unfortunate. The sequence \t\t} exhibits the same behavior, and is very common in Go code, so there are lots of missed merges.
(This also reproduces uses tiktoken.)