tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Adding tokens to a tokenizer with subword support?

Open noamgat opened this issue 4 months ago • 1 comments

Hi, When I add an out-of-vocab character to a tokenizer, I am only able to get the new token ID when I encode it as a whole word, but not as a subword. Is there a parameter that I need to add to the call for it to also work in subwords?

Example:

from transformers import AutoTokenizer
from tokenizers import AddedToken
model_id = 'TheBloke/Llama-2-7b-Chat-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)
new_char = 'ç­¹'
tokenizer.add_tokens(AddedToken(new_char, single_word=False, lstrip=True))
print(tokenizer.encode(new_char))
print(tokenizer.encode(new_char + new_char))
print(tokenizer.encode('"' + new_char))

And the output would be

[1, 32000]
[1, 32000, 234, 176, 188]
[1, 376, 234, 176, 188]

What do I need to modify in my add_tokens call, for me to be able to get the desired 32000 token twice in the second example, and once in the 3rd?

noamgat avatar Sep 27 '24 18:09 noamgat