ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[coati] How to get prompt_path and pretrain_dataset?

Open gongel opened this issue 1 year ago • 9 comments

Hi, I want to reproduce the training process but have no two datasets. Do you have plans to open source datasets? Thx. https://github.com/hpcaitech/ColossalAI/blob/638a07a7f9b504e6c9781e9aa2a9b6c5e9dc49ed/applications/Chat/examples/train_prompts.py#L208-L209

gongel avatar Apr 03 '23 11:04 gongel

And pretrain_dataset is sft dataset?

gongel avatar Apr 03 '23 12:04 gongel

@gongel We have released it. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#supervised-datasets-collection

binmakeswell avatar Apr 06 '23 09:04 binmakeswell

OK. So pretrain_dataset is sft dataset and InstructionWild is sft dataset. For prompt_datasets, we can produce it from InstructionWild?

gongel avatar Apr 06 '23 09:04 gongel

I also have the same question,pretrain_dataset is sft dataset.But what dataset is prompt_path? Can the same be pretrain_dataset?

guijuzhejiang avatar Apr 07 '23 07:04 guijuzhejiang

Is it possible to extract a part of instructions in pretrain_dataset(sft dataset) to generate prompt_datasets?

guijuzhejiang avatar Apr 07 '23 08:04 guijuzhejiang

Is it possible to extract a part of instructions in pretrain_dataset(sft dataset) to generate prompt_datasets?

I think so, too

gongel avatar Apr 10 '23 02:04 gongel

@gongel We have released it. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#supervised-datasets-collection

In the init function of the class SupervisedDataset (sft_dataset.py), there is one line of source code using the "output" key after loading the pretrain_dataset as below. targets = [f"{example['output']}{tokenizer.eos_token}" for example in list_data_dict]

It will cause key errors when running train_prompts.py

Could the page describe the required column names in all data sets you used?

Thanks.

nctu6 avatar Apr 12 '23 05:04 nctu6

@gongel have you already solved this problem ?

XiaoLaoDi avatar Apr 21 '23 02:04 XiaoLaoDi

Is it possible to extract a part of instructions in pretrain_dataset(sft dataset) to generate prompt_datasets?

I think so, too

hihi, 大佬好。想请教您一个问题,我们在训练sft以及prompt的时候,SupervisedDataset()会对instruction进行改写,但是我们在训练reward_model时没有进行改写,在inference.py中也没有进行改写。这样不同训练阶段的输入存在一定差别,不会对模型输出有影响么?

vincezengqiang avatar Apr 24 '23 02:04 vincezengqiang

@gongel We have released it. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#supervised-datasets-collection

which file should be used

SeekPoint avatar Jul 12 '23 09:07 SeekPoint