Data Organization of Pretrain Data
Thanks for your great work!
I wonder if the pretrain data also needs to be organized in the following format:
<BOS><start_id>user<end_id>\nWhat is the capital of France?<eot_id><start_id>assistant<end_id>\nParis.<EOS>....{until max token length}
In addition, do those special tokens need to be added like this?
token_list = ["<BOS>", "<start_id>", "<end_id>", "<eot_id>", "<EOS>"]
num_new_tokens = tokenizer.add_tokens(token_list, special_tokens=True)
So that my prompt length can correspond to the one in the guidance.md.
The pretrain data and the SFT data follow different organization formats. Simply put, the pretrain data does not include user, assistant, or their associated special tokens. You may refer to the standard pretrain data formats used for autoregressive models.
For details on how special tokens are added, please consult the configuration at: https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct/blob/main/tokenizer_config.json#L2167.