gemma Unused tokens in gemma tokenizer

trafficstars

I am using "google/gemma-2b-it" model from HuggingFace. I realized there are 99 unused tokens (<unused0> ,<unused1>,<unused2>...) in first 106 token ids. Does anyone know their purpose? Just wondering.

Apr 17 '24 22:04 hboyar

@hboyar, If you iterate through the vocab, you should find some <unusedXX> tokens. They weren't used for training, but can be used for any other purpose.

99 unused tokens are reserved in the pretrained tokenizer model to assist with more efficient training/fine-tuning. Unused tokens are in the string format of <unused[0-98]> with a token id range of [7-105]. Also those will assist model developers to more efficiently train/fine-tune Gemma by re-using the reserved tokens rather than attempting to resize the BPE based tokenizer. let us know if this helps. Thank you!

Apr 19 '24 11:04 tilakrayal

That's a really cool feature. It would definitely help when creating custom fine tuning tasks. I'll keep this in mind. Thank you!

Apr 19 '24 11:04 hboyar

Hey guys,

I was wondering if there is a way to reassign those unused. for something new token I want to induce to the model. some thing like. to <new_token>.

Thanks.

Aug 01 '24 09:08 kishoreKunisetty

Hi @kishoreKunisetty,

We can add new tokens as vocabulary but direct modification of the tokenizer’s model is unsupported and can lead to errors due to high-level APIs implementation. Please refer to this gist If you want to replace unused tokens with custom tokens without expanding the vocabulary, you'll need to manually adjust the tokenizer’s config files, including the vocab.json and tokenizer.json files.

Thank you.

Oct 09 '24 06:10 Gopi-Uppari

Could you please confirm if this issue is resolved for you with the above comment ? Please feel free to close the issue if it is resolved ?

Thank you.

Oct 23 '24 05:10 Gopi-Uppari

Resolved.

Oct 23 '24 18:10 hboyar

gemma gemma copied to clipboard

Unused tokens in gemma tokenizer

gemma
gemma copied to clipboard