[BUG]Zero stage3 can not save model weights correctly!
Describe the bug Traing the llama-7b model with zero stage3 and set stage3_gather_16bit_weights_on_model_save to true in ds_config.json, but the size of saved pytorch-model.bin is only 610K. It is strange that the saved model in checkpoint is normal.
The deepspeed version is 0.9.6
Hi, Have you managed to solve it? I am also facing same issue. Thanks
@tarrett, thanks for reporting this. Can you please share repro steps?
Same issue: with Deepspeed Zero Stage3 + Transformers Trainer, we can't correctly save the final model weight after training with trainer.save_model(). However, we can save the checkpoints during training instead.
@ZubinGou, can you please share details to help us repro? Thanks!
Sure. Simply use the official Zero Stage3 config by setting stage3_gather_16bit_weights_on_model_save as true following this. Then, use the Huggingface Trainer to train a GPT-2 or LLaMA (or any models) with trainer.train(), and save the model with trainer.save_model(), you will find the saved weight is still not complete.
You can use any of the following repositories to reproduce this issue:
- https://github.com/OptimalScale/LMFlow
- https://github.com/AetherCortex/Llama-X
- https://github.com/lm-sys/FastChat
Hello, have you solved this?
Also encountered this error. The saved checkpoint is quite small and not at all usable.
I encountered the same issue. I found the following two approaches works for me
First, set
ds_config = { "zero_optimization": { "stage": 3, "stage3_gather_16bit_weights_on_model_save": True, "offload_param": {"device": "none"}, "offload_optimizer": {"device": "none"} }, "bf16": {"enabled": True}, "gradient_accumulation_steps": 1 } ds_plugin = DeepSpeedPlugin( hf_ds_config=ds_config, zero3_save_16bit_model=True )
Then 1) accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) save_path = f"/home/jovyan/workspace/fang375/RLHF_test/model_parameters_set/{model_name}" unwrapped_model.save_pretrained( save_path, is_main_process=accelerator.is_main_process, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model), )
it works for 1.5B model but still fails on 3B and some larger models
accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) save_path = f"./model_parameters_set/{model_name}" with deepspeed.zero.GatheredParameters(list(unwrapped_model.parameters()), modifier_rank=0): if accelerator.is_main_process: unwrapped_model.save_pretrained(save_path)
The second one works for 3B model