LLaDA icon indicating copy to clipboard operation
LLaDA copied to clipboard

num-start and num-end

Open GGchen1997 opened this issue 4 months ago • 3 comments

Thanks for the fantastic work you present!

https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct/blob/main/special_tokens_map.json

In the tokenizer, we have num-start and num-end, two special tokens. I want to ask whether the two tokens have been used in pre-training so the model can handle numbers specifically.

GGchen1997 avatar Aug 02 '25 12:08 GGchen1997

No, these two special tokens are not used during pre-training, so they do not have any effect.

Monohydroxides avatar Aug 05 '25 05:08 Monohydroxides

thanks for your reply. Is there a suggested special token for us to handle numbers specifically? Are <|arithmetic_start|> and <|arithmetic_end|> used in the pre-training? How are "role" tokens used in pre-training?

GGchen1997 avatar Aug 05 '25 13:08 GGchen1997

thanks for your reply. Is there a suggested special token for us to handle numbers specifically? Are <|arithmetic_start|> and <|arithmetic_end|> used in the pre-training? How are "role" tokens used in pre-training?

Other special tokens for handling numbers, such as <|arithmetic_start|> and <|arithmetic_end|>, were also not used during pre-training. The "role" tokens in the tokenizer were not used in pre-training either.

Monohydroxides avatar Aug 06 '25 07:08 Monohydroxides