llama3 icon indicating copy to clipboard operation
llama3 copied to clipboard

missed double-tab merge opportunities in the tokenizer

Open josharian opened this issue 1 year ago • 1 comments

I was playing with the tokenizer, and I noticed some missed merge opportunities.

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer(['\t', '\t\t', '-\t\t', '\t\t-'])
{'input_ids': [[128000, 197], [128000, 298], [128000, 12, 298], [128000, 197, 197, 12]], 'attention_mask': [[1, 1], [1, 1], [1, 1, 1], [1, 1, 1, 1]]}

Observe:

  • tab is 197
  • tab tab is 298
  • but tab tab is not merged when followed by "-"

This is probably a consequence of the how regex splits, and thus in some sense not a bug...but it is somewhat unfortunate. The sequence \t\t} exhibits the same behavior, and is very common in Go code, so there are lots of missed merges.

josharian avatar Jun 01 '24 01:06 josharian

(This also reproduces uses tiktoken.)

josharian avatar Jun 03 '24 23:06 josharian