How do I replace a spare tokens?
System Info
I want to SFT Mistral-v0.3 with my own chat template. So I followed this comment and replaced some [controal_n] tokens with special tokens for the chat template. However, the new vocabulary was actually added and the size of the vocabulary increased. Is there any way to replace the vocabulary?
Who can help?
@ArthurZucker
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
tokenizer.json
{
"version": "1.0",
"truncation": null,
"padding": null,
"added_tokens": [
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
{
"id": 10,
"content": "<|system|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 11,
"content": "<|user|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 12,
"content": "<|assistant|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 13,
"content": "<|eot|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokenizer_config.json
{
"add_bos_token": true,
"add_eos_token": false,
"add_prefix_space": true,
"added_tokens_decoder": {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"10": {
"content": "<|system|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"11": {
"content": "<|user|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"12": {
"content": "<|assistant|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"13": {
"content": "<|eot|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
}
test code
tokenizer = AutoTokenizer.from_pretrained(model_dir)
pprint(tokenizer.added_tokens_decoder)
output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
768: AddedToken("[control_766]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
769: AddedToken("[control_767]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
770: AddedToken("[control_768]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32768: AddedToken("<|system|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32769: AddedToken("<|user|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32770: AddedToken("<|assistant|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32771: AddedToken("<|eot|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)}
Expected behavior
[control_n] Tokens can be replaced with any token.
cc @itazap that would indeed be a good addition! More and more people pre-allocate some tokens and we don't have a replace token.
PS: you can already replace directly in the vocab and the added_vocab (since there tokens are part of both)
Hey @ArthurZucker,
I tried replacing a token in the vocab (not the added_tokens) for the tokenizer.json file. But when I try to load the tokenizer back up new_tokenizer = AutoTokenizer.from_pretrained('path/to/tokenizer) I get the following error: "Exception: data did not match any variant of untagged enum ModelWrapper at line 356367 column 3"
Do you know what the problem might be?
Hey @ArthurZucker,
I tried replacing a token in the
vocab(not the added_tokens) for the tokenizer.json file. But when I try to load the tokenizer back upnew_tokenizer = AutoTokenizer.from_pretrained('path/to/tokenizer)I get the following error: "Exception: data did not match any variant of untagged enum ModelWrapper at line 356367 column 3"Do you know what the problem might be?
You can ignore this sorry. I found the issue. if you change the vocab in anyway, you need to make sure you also update the merges accordingly.
https://github.com/huggingface/tokenizers/pull/1570 should help
PS: you can already replace directly in the
vocaband theadded_vocab(since there tokens are part of both)
Hi @ArthurZucker, do you mind elaborating on this? I'm experiencing the same issue as OP after modifying tokenizer.json and tokenizer_config.json. After loading the local tokenizer, I am unable to reassign or delete any entries in tokenizer.vocab manually. For example, del tokenizer.vocab['<token-to-replace>'] does not have any effect. I'm also unsure how to modify added_vocab.
I would love to see this feature go through!
Hi @itshuey, can you please share the model and token you are attempting this with (in a short snippet would be great!) so I can take a look? 😊
Sure @itazap, thank you. I am using the Mistral-7B-v0.3 tokenizer.
~ from transformers import AutoTokenizer
~ tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.3')
~ tokenizer.vocab['[control_8]']
10
~ del tokenizer.vocab['[control_8]']
~ tokenizer.vocab['[control_8]']
10
Reassignment is also futile.
Hi @itshuey indeed this isn't fully supported with AutoTokenizer because it reads the tokenizer.model file which can't be modified manually. However you should be able to remove/ update if you use
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
instead of AutoTokenizer. Let me know if this works for your use case! 😊
I'm getting this error when I try to instantiate the PreTrainedTokenizerFast object
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LlamaTokenizer'.
The class this function is called from is 'PreTrainedTokenizerFast'.
Regardless, I get the same result from trying to modify tokenizer.vocab:
>>> tokenizer.vocab['[control_8]']
10
>>> del tokenizer.vocab['[control_8]']
>>> tokenizer.vocab['[control_8]']
10
Taking the tip into account, I tried using LlamaTokenizer to load my tokenizer. Since get_vocab relies on sentencepiece, I tried to use tokenizer.sp_model.set_vocabulary(), but I couldn't figure out what the List valid_vocab parameter was meant to be. I am hoping there's a transformers-based solution to replace unused control tokens associated with specific id's (in my case 10) without increasing the vocabulary size.
Sorry I apologize I wasn't very clear. You are correct that .vocab cannot be modified programmatically, and using del or pop would not work. Updating or deleting tokens would be a new feature and would be supported in the feature Arthur linked. The error text you are getting is a warning and it should not error out / fail.
For now, you can do this by:
- Loading the model with
PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3') - Saving the model and locating the model folder with
tokenizer.jsonandtokenizer_config.jsonfiles. - Manually editing the
tokenizer.jsonfile (note: there are 2 changes needed in this json: in "added_tokens" and in "vocab") and thetokenizer_config.jsonfile (1 change in "added_tokens_decoder"). - Loading the model from the local folder you modified.
tokenizer = PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
tokenizer.vocab['[control_8]'] # 10
tokenizer.save_pretrained(temp_folder)
# Pause here and edit both tokenizer.json and tokenizer_config.json. Example below changed control_8 to control_NEW
tokenizer_reloaded = PreTrainedTokenizerFast.from_pretrained(temp_folder)
tokenizer_reloaded.vocab['[control_NEW]'] # 10
tokenizer_reloaded.added_tokens_decoder[10] # 'AddedToken("[control_NEW]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)'
tokenizer_reloaded.vocab['control_8'] # throws error
Please let me know if you are able to reproduce!
Thank you for the clarification @itazap, modifying the configs worked perfectly! After using save_pretrain with my PreTrainedTokenizerFast tokenizer, I was able to load it locally (with the proper overwritten tokens) via AutoTokenizer as well. Really appreciate your help with this!
Awesome, I'm glad it worked! Thanks for your patience 🤗