ColossalAI
ColossalAI copied to clipboard
[BUG]: How can i save checkpoint trained by the way of GeminiDDP
🐛 Describe the bug
when running the model OPT,but i have no idea how to save the checkpoint. I tried the code
model = GeminiDDP(model, device=get_current_device(), placement_policy=PLACEMENT_POLICY, pin_memory=True)
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir)
output_dir's file is so small,about 300k,but the model is 16G,in the mean while, it can't be loaded by transformers's api.
Environment
No response
add model = get_static_torch_model(model) before saving.
@haofanwang correct! get_static_torch_model will return a torch model, you can do whatever you want on it.
@1SAA @FrankLeeeee we should add a chapter in our document on model saving and loading.
Thank you for the reply.
But I'm not sure whether this method model_to_save = model.module if hasattr(model, 'module') else model is a way of get_static_torch_model ? this method's checkpoint is only 170k. @feifeibear @haofanwang
No. Import the one below
from colossalai.nn.parallel.utils import get_static_torch_model
And add model = get_static_torch_model(model) before saving.
We have updated a lot. This issue was closed due to inactivity. Thanks.