Special token_ids in tokenizer

Open Happenmass opened this issue 1 year ago • 1 comments

I have noticed there are 100 additional_special_tokens in the "tokenizer_config.json" of the official repo in huggingface, but I did not find any other places these special tokens have been used, could you please share any information about them?

| "32000": { | "content": "<extra_id_99>", | "lstrip": false, | "normalized": false, | "rstrip": false, | "single_word": false, | "special": true | }, | "32001": { | "content": "<extra_id_98>", | "lstrip": false, | "normalized": false, | "rstrip": false, | "single_word": false, | "special": true | }, | "32002": { | "content": "<extra_id_97>", | "lstrip": false, | "normalized": false, | "rstrip": false, | "single_word": false, | "special": true | },

Jul 02 '24 08:07 Happenmass

No particular usage of these tokens, but you could use them to add tokens when you train the model

Aug 01 '24 15:08 ylacombe