ColossalAI [BUG]: How can i save checkpoint trained by the way of GeminiDDP

[BUG]: How can i save checkpoint trained by the way of GeminiDDP

Open upwindflys opened this issue 2 years ago • 4 comments

🐛 Describe the bug

when running the model OPT,but i have no idea how to save the checkpoint. I tried the code

model = GeminiDDP(model, device=get_current_device(), placement_policy=PLACEMENT_POLICY, pin_memory=True)
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir)

output_dir's file is so small,about 300k,but the model is 16G,in the mean while, it can't be loaded by transformers's api.

Environment

No response

Jan 13 '23 09:01 upwindflys

add model = get_static_torch_model(model) before saving.

Jan 13 '23 11:01 haofanwang

@haofanwang correct! get_static_torch_model will return a torch model, you can do whatever you want on it.

@1SAA @FrankLeeeee we should add a chapter in our document on model saving and loading.

Jan 14 '23 03:01 feifeibear

Thank you for the reply. But I'm not sure whether this method model_to_save = model.module if hasattr(model, 'module') else model is a way of get_static_torch_model ? this method's checkpoint is only 170k. @feifeibear @haofanwang

Jan 18 '23 05:01 upwindflys

No. Import the one below from colossalai.nn.parallel.utils import get_static_torch_model And add model = get_static_torch_model(model) before saving.

Feb 02 '23 05:02 JThh

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 18 '23 08:04 binmakeswell

ColossalAI ColossalAI copied to clipboard

[BUG]: How can i save checkpoint trained by the way of GeminiDDP

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard