gemma_pytorch icon indicating copy to clipboard operation
gemma_pytorch copied to clipboard

Are there reserved/unused tokens for developers?

Open Qubitium opened this issue 1 year ago • 0 comments

Due to BPE vocabulary unable to dynamically expand after training, for finetuning, some BPE tokenizer based models such as Qwen reserved 2k extra unused tokens at the end for developers to use as they see fit.

Does Gemma have a list of internally unused tokens?

Sometimes model makers resize a vocab to a nice gpu-friendly multiple which creates unused tokens or intentially leave some unused tokens such as Qwen.

Qubitium avatar Feb 23 '24 05:02 Qubitium