Data Organization of Pretrain Data

Open lll2343 opened this issue 9 months ago • 1 comments

Thanks for your great work!

I wonder if the pretrain data also needs to be organized in the following format:
<BOS><start_id>user<end_id>\nWhat is the capital of France?<eot_id><start_id>assistant<end_id>\nParis.<EOS>....{until max token length}

In addition, do those special tokens need to be added like this?

token_list = ["<BOS>", "<start_id>", "<end_id>", "<eot_id>", "<EOS>"]
num_new_tokens = tokenizer.add_tokens(token_list, special_tokens=True)

So that my prompt length can correspond to the one in the guidance.md.

Mar 11 '25 07:03 lll2343

The pretrain data and the SFT data follow different organization formats. Simply put, the pretrain data does not include user, assistant, or their associated special tokens. You may refer to the standard pretrain data formats used for autoregressive models.

For details on how special tokens are added, please consult the configuration at: https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct/blob/main/tokenizer_config.json#L2167.

Mar 12 '25 01:03 Monohydroxides