DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]Zero stage3 can not save model weights correctly!

Open tarrett opened this issue 2 years ago • 8 comments

Describe the bug Traing the llama-7b model with zero stage3 and set stage3_gather_16bit_weights_on_model_save to true in ds_config.json, but the size of saved pytorch-model.bin is only 610K. It is strange that the saved model in checkpoint is normal.

The deepspeed version is 0.9.6

tarrett avatar Jun 29 '23 03:06 tarrett

Hi, Have you managed to solve it? I am also facing same issue. Thanks

ajinkya123-robo avatar Jul 01 '23 19:07 ajinkya123-robo

@tarrett, thanks for reporting this. Can you please share repro steps?

tjruwase avatar Jul 03 '23 13:07 tjruwase

Same issue: with Deepspeed Zero Stage3 + Transformers Trainer, we can't correctly save the final model weight after training with trainer.save_model(). However, we can save the checkpoints during training instead.

ZubinGou avatar Jul 04 '23 12:07 ZubinGou

@ZubinGou, can you please share details to help us repro? Thanks!

tjruwase avatar Jul 05 '23 20:07 tjruwase

Sure. Simply use the official Zero Stage3 config by setting stage3_gather_16bit_weights_on_model_save as true following this. Then, use the Huggingface Trainer to train a GPT-2 or LLaMA (or any models) with trainer.train(), and save the model with trainer.save_model(), you will find the saved weight is still not complete.

You can use any of the following repositories to reproduce this issue:

  • https://github.com/OptimalScale/LMFlow
  • https://github.com/AetherCortex/Llama-X
  • https://github.com/lm-sys/FastChat

ZubinGou avatar Jul 06 '23 02:07 ZubinGou

Hello, have you solved this?

Mr-lonely0 avatar Jun 19 '24 01:06 Mr-lonely0

Also encountered this error. The saved checkpoint is quite small and not at all usable.

BiEchi avatar Apr 03 '25 18:04 BiEchi

I encountered the same issue. I found the following two approaches works for me

First, set

ds_config = { "zero_optimization": { "stage": 3, "stage3_gather_16bit_weights_on_model_save": True, "offload_param": {"device": "none"}, "offload_optimizer": {"device": "none"} }, "bf16": {"enabled": True}, "gradient_accumulation_steps": 1 } ds_plugin = DeepSpeedPlugin( hf_ds_config=ds_config, zero3_save_16bit_model=True )

Then 1) accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) save_path = f"/home/jovyan/workspace/fang375/RLHF_test/model_parameters_set/{model_name}" unwrapped_model.save_pretrained( save_path, is_main_process=accelerator.is_main_process, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model), )

it works for 1.5B model but still fails on 3B and some larger models

accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) save_path = f"./model_parameters_set/{model_name}" with deepspeed.zero.GatheredParameters(list(unwrapped_model.parameters()), modifier_rank=0): if accelerator.is_main_process: unwrapped_model.save_pretrained(save_path)

The second one works for 3B model

wenzhifang avatar May 30 '25 02:05 wenzhifang