tiktoken Custom tokenizer fails to encode despite characters being in mergeable

trafficstars

Hello,

I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.

Here is a simple example for reproducibility:

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'“'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("a“")) # this works, [1, 0]
print(enc.encode("“a"))

Any ideas for how to fix this?

Thanks in advance for the help

May 02 '24 10:05 afang-story

It also happens with non-Latin characters the other way round e.g.

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'か'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("aか"))

Maybe there's some setting that needs to be changed / some fallbacks that need to be added that cover this?

May 02 '24 18:05 Muennighoff

I'm having the same issue, have you solved it？

Hello,

I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.

Here is a simple example for reproducibility:
import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'“'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("a“")) # this works, [1, 0]
print(enc.encode("“a"))
Any ideas for how to fix this?

Thanks in advance for the help

Jul 28 '24 09:07 djsaber

>>> '“'.encode()
b'\xe2\x80\x9c'
>>> len('“'.encode())
3

You'll need to have individual bytes in your vocabulary.

On top of that tiktoken makes the assumption that token index corresponds to merge priority (i.e. the sequence of merges to produce a token needs to produce intermediate tokens with value in increasing order). https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/src/lib.rs#L25

Oct 03 '24 22:10 hauntsaninja

tiktoken
tiktoken copied to clipboard

Custom tokenizer fails to encode despite characters being in mergeable_ranks

tiktoken tiktoken copied to clipboard

Custom tokenizer fails to encode despite characters being in mergeable_ranks

tiktoken
tiktoken copied to clipboard