ColossalAI
ColossalAI copied to clipboard
[coati] How to get prompt_path and pretrain_dataset?
Hi, I want to reproduce the training process but have no two datasets. Do you have plans to open source datasets? Thx. https://github.com/hpcaitech/ColossalAI/blob/638a07a7f9b504e6c9781e9aa2a9b6c5e9dc49ed/applications/Chat/examples/train_prompts.py#L208-L209
And pretrain_dataset is sft dataset?
@gongel We have released it. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#supervised-datasets-collection
OK. So pretrain_dataset is sft dataset and InstructionWild is sft dataset. For prompt_datasets, we can produce it from InstructionWild?
I also have the same question,pretrain_dataset is sft dataset.But what dataset is prompt_path? Can the same be pretrain_dataset?
Is it possible to extract a part of instructions in pretrain_dataset(sft dataset) to generate prompt_datasets?
Is it possible to extract a part of instructions in pretrain_dataset(sft dataset) to generate prompt_datasets?
I think so, too
@gongel We have released it. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#supervised-datasets-collection
In the init function of the class SupervisedDataset (sft_dataset.py), there is one line of source code using the "output" key after loading the pretrain_dataset as below. targets = [f"{example['output']}{tokenizer.eos_token}" for example in list_data_dict]
It will cause key errors when running train_prompts.py
Could the page describe the required column names in all data sets you used?
Thanks.
@gongel have you already solved this problem ?
Is it possible to extract a part of instructions in pretrain_dataset(sft dataset) to generate prompt_datasets?
I think so, too
hihi, 大佬好。想请教您一个问题,我们在训练sft以及prompt的时候,SupervisedDataset()会对instruction进行改写,但是我们在训练reward_model时没有进行改写,在inference.py中也没有进行改写。这样不同训练阶段的输入存在一定差别,不会对模型输出有影响么?
@gongel We have released it. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#supervised-datasets-collection
which file should be used