gemma
gemma copied to clipboard
Unused tokens in gemma tokenizer
I am using "google/gemma-2b-it" model from HuggingFace. I realized there are 99 unused tokens (<unused0> ,<unused1>,<unused2>...) in first 106 token ids. Does anyone know their purpose? Just wondering.
@hboyar, If you iterate through the vocab, you should find some <unusedXX> tokens. They weren't used for training, but can be used for any other purpose.
99 unused tokens are reserved in the pretrained tokenizer model to assist with more efficient training/fine-tuning. Unused tokens are in the string format of <unused[0-98]> with a token id range of [7-105]. Also those will assist model developers to more efficiently train/fine-tune Gemma by re-using the reserved tokens rather than attempting to resize the BPE based tokenizer. let us know if this helps. Thank you!
That's a really cool feature. It would definitely help when creating custom fine tuning tasks. I'll keep this in mind. Thank you!
Hey guys,
I was wondering if there is a way to reassign those unused. for something new token I want to induce to the model.
some thing like.
Thanks.
Hi @kishoreKunisetty,
We can add new tokens as vocabulary but direct modification of the tokenizer’s model is unsupported and can lead to errors due to high-level APIs implementation. Please refer to this gist If you want to replace unused tokens with custom tokens without expanding the vocabulary, you'll need to manually adjust the tokenizer’s config files, including the vocab.json and tokenizer.json files.
Thank you.
Could you please confirm if this issue is resolved for you with the above comment ? Please feel free to close the issue if it is resolved ?
Thank you.
Resolved.