ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Training model save problem

Open xcxhy opened this issue 2 years ago • 10 comments

Describe the feature

Hello, I use the ChatGPT/examples/train_dummpy.py to train GPT2, and the training is successful, but the trained model cannot be found. Can you write a document to explain the problems of saving the model and using the training model?

xcxhy avatar Feb 16 '23 06:02 xcxhy

i met the same problem too

terminator123 avatar Feb 16 '23 07:02 terminator123

Thanks for your feedback, but I have to remind you that train_dummy.py using random input, and you may not find any improvement after training. Anyway, we'll add some guidance and demo to help save and inference the model.

ht-zhou avatar Feb 16 '23 07:02 ht-zhou

@ht-zhou I have read source code, maybe train_prompts.py also not save the checkpoint. please add more demo help us to run it.

HuggingLLM avatar Feb 16 '23 09:02 HuggingLLM

@NLP-ZY Agree. I notice is that neither train_dumm.py nor train_prompts.py save the checkpoint.

xcxhy avatar Feb 16 '23 09:02 xcxhy

same

YeHaijia avatar Feb 16 '23 09:02 YeHaijia

same

lajiyuan avatar Feb 19 '23 03:02 lajiyuan

I mkdir first as follows and save the model correctly. os.makedirs(os.path.dirname(args.save_path), exist_ok=True)

Syno8 avatar Feb 19 '23 09:02 Syno8

i met the same problem too

Hello, did you solve the problem?

A-biao96 avatar Feb 21 '23 07:02 A-biao96

This issue shall be fixed in #2846 once it is merged.

FrankLeeeee avatar Feb 21 '23 08:02 FrankLeeeee

i met the same problem too

Hello, did you solve the problem? This issue shall be fixed in #2846 once it is merged.

Thanks, I got it.

A-biao96 avatar Feb 22 '23 00:02 A-biao96