tokenizers
tokenizers copied to clipboard
Adding tokens to a tokenizer with subword support?
Hi, When I add an out-of-vocab character to a tokenizer, I am only able to get the new token ID when I encode it as a whole word, but not as a subword. Is there a parameter that I need to add to the call for it to also work in subwords?
Example:
from transformers import AutoTokenizer
from tokenizers import AddedToken
model_id = 'TheBloke/Llama-2-7b-Chat-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)
new_char = 'ç¹'
tokenizer.add_tokens(AddedToken(new_char, single_word=False, lstrip=True))
print(tokenizer.encode(new_char))
print(tokenizer.encode(new_char + new_char))
print(tokenizer.encode('"' + new_char))
And the output would be
[1, 32000]
[1, 32000, 234, 176, 188]
[1, 376, 234, 176, 188]
What do I need to modify in my add_tokens call, for me to be able to get the desired 32000 token twice in the second example, and once in the 3rd?