transformers How do I replace a spare tokens?

System Info

I want to SFT Mistral-v0.3 with my own chat template. So I followed this comment and replaced some [controal_n] tokens with special tokens for the chat template. However, the new vocabulary was actually added and the size of the vocabulary increased. Is there any way to replace the vocabulary?

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

tokenizer.json

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
{
      "id": 10,
      "content": "<|system|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 11,
      "content": "<|user|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 12,
      "content": "<|assistant|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 13,
      "content": "<|eot|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tokenizer_config.json

{
  "add_bos_token": true,
  "add_eos_token": false,
  "add_prefix_space": true,
  "added_tokens_decoder": {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "10": {
          "content": "<|system|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "11": {
          "content": "<|user|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "12": {
          "content": "<|assistant|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "13": {
          "content": "<|eot|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
}

test code

tokenizer =  AutoTokenizer.from_pretrained(model_dir)
pprint(tokenizer.added_tokens_decoder)

output

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 768: AddedToken("[control_766]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 769: AddedToken("[control_767]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 770: AddedToken("[control_768]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32768: AddedToken("<|system|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32769: AddedToken("<|user|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32770: AddedToken("<|assistant|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32771: AddedToken("<|eot|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)}

Expected behavior

[control_n] Tokens can be replaced with any token.

Jun 18 '24 12:06 onaka-ga-pkpk

cc @itazap that would indeed be a good addition! More and more people pre-allocate some tokens and we don't have a replace token.

Jun 19 '24 14:06 ArthurZucker

PS: you can already replace directly in the vocab and the added_vocab (since there tokens are part of both)

Jun 19 '24 14:06 ArthurZucker

Hey @ArthurZucker,

I tried replacing a token in the vocab (not the added_tokens) for the tokenizer.json file. But when I try to load the tokenizer back up new_tokenizer = AutoTokenizer.from_pretrained('path/to/tokenizer) I get the following error: "Exception: data did not match any variant of untagged enum ModelWrapper at line 356367 column 3"

Do you know what the problem might be?

Jun 21 '24 23:06 lee-onidas

Hey @ArthurZucker,

I tried replacing a token in the vocab (not the added_tokens) for the tokenizer.json file. But when I try to load the tokenizer back up new_tokenizer = AutoTokenizer.from_pretrained('path/to/tokenizer) I get the following error: "Exception: data did not match any variant of untagged enum ModelWrapper at line 356367 column 3"

Do you know what the problem might be?

You can ignore this sorry. I found the issue. if you change the vocab in anyway, you need to make sure you also update the merges accordingly.

Jun 25 '24 03:06 lee-onidas

https://github.com/huggingface/tokenizers/pull/1570 should help

Jul 12 '24 10:07 ArthurZucker

PS: you can already replace directly in the vocab and the added_vocab (since there tokens are part of both)

Hi @ArthurZucker, do you mind elaborating on this? I'm experiencing the same issue as OP after modifying tokenizer.json and tokenizer_config.json. After loading the local tokenizer, I am unable to reassign or delete any entries in tokenizer.vocab manually. For example, del tokenizer.vocab['<token-to-replace>'] does not have any effect. I'm also unsure how to modify added_vocab.

I would love to see this feature go through!

Aug 25 '24 02:08 itshuey

Hi @itshuey, can you please share the model and token you are attempting this with (in a short snippet would be great!) so I can take a look? 😊

Aug 26 '24 11:08 itazap

Sure @itazap, thank you. I am using the Mistral-7B-v0.3 tokenizer.

~ from transformers import AutoTokenizer
~ tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.3')
~ tokenizer.vocab['[control_8]'] 
10
~ del tokenizer.vocab['[control_8]'] 
~ tokenizer.vocab['[control_8]'] 
10

Reassignment is also futile.

Aug 26 '24 12:08 itshuey

Hi @itshuey indeed this isn't fully supported with AutoTokenizer because it reads the tokenizer.model file which can't be modified manually. However you should be able to remove/ update if you use

tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)

instead of AutoTokenizer. Let me know if this works for your use case! 😊

Aug 26 '24 19:08 itazap

I'm getting this error when I try to instantiate the PreTrainedTokenizerFast object

>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LlamaTokenizer'.
The class this function is called from is 'PreTrainedTokenizerFast'.

Regardless, I get the same result from trying to modify tokenizer.vocab:

>>> tokenizer.vocab['[control_8]']
10
>>> del tokenizer.vocab['[control_8]']
>>> tokenizer.vocab['[control_8]']
10

Taking the tip into account, I tried using LlamaTokenizer to load my tokenizer. Since get_vocab relies on sentencepiece, I tried to use tokenizer.sp_model.set_vocabulary(), but I couldn't figure out what the List valid_vocab parameter was meant to be. I am hoping there's a transformers-based solution to replace unused control tokens associated with specific id's (in my case 10) without increasing the vocabulary size.

Aug 27 '24 17:08 itshuey

Sorry I apologize I wasn't very clear. You are correct that .vocab cannot be modified programmatically, and using del or pop would not work. Updating or deleting tokens would be a new feature and would be supported in the feature Arthur linked. The error text you are getting is a warning and it should not error out / fail.

For now, you can do this by:

Loading the model with PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
Saving the model and locating the model folder with tokenizer.json and tokenizer_config.json files.
Manually editing the tokenizer.json file (note: there are 2 changes needed in this json: in "added_tokens" and in "vocab") and the tokenizer_config.json file (1 change in "added_tokens_decoder").
Loading the model from the local folder you modified.

tokenizer = PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
tokenizer.vocab['[control_8]'] # 10
tokenizer.save_pretrained(temp_folder)

# Pause here and edit both tokenizer.json and tokenizer_config.json. Example below changed control_8 to control_NEW
tokenizer_reloaded = PreTrainedTokenizerFast.from_pretrained(temp_folder)
tokenizer_reloaded.vocab['[control_NEW]'] # 10
tokenizer_reloaded.added_tokens_decoder[10] # 'AddedToken("[control_NEW]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)'
tokenizer_reloaded.vocab['control_8'] # throws error

Please let me know if you are able to reproduce!

Aug 27 '24 19:08 itazap

Thank you for the clarification @itazap, modifying the configs worked perfectly! After using save_pretrain with my PreTrainedTokenizerFast tokenizer, I was able to load it locally (with the proper overwritten tokens) via AutoTokenizer as well. Really appreciate your help with this!

Aug 27 '24 23:08 itshuey

Awesome, I'm glad it worked! Thanks for your patience 🤗

Aug 28 '24 12:08 itazap