DeepSpeedExamples
DeepSpeedExamples copied to clipboard
When training GPT-2 with Zero-3, some parameters will be missing when saving the model
I replaced the model in steps 1 and 2 with a GPT-2 model: IDEA-CCNL/Wenzhong-GPT2-110M.
Then use Zero-3 for training, the command is as follows:
python train.py --actor-zero-stage 3 --actor-model 110m --reward-zero-stage 3 --reward-model 110m --deployment-type single_node
But in the third step, I encountered a Missing key(s)
error.
The left side is the actual parameters of the reward model, and the right side is the error reported when running in step 3, and some parameters are missing.
In the end, we located it because of the problem when Zero-3 saved the model.
def save_zero_three_model(model_ema, global_rank, save_dir, zero_stage=0):
zero_stage_3 = (zero_stage == 3)
os.makedirs(save_dir, exist_ok=True)
WEIGHTS_NAME = "pytorch_model.bin"
output_model_file = os.path.join(save_dir, WEIGHTS_NAME)
model_to_save = model_ema.module if hasattr(model_ema, 'module') else model_ema
if not zero_stage_3:
if global_rank == 0:
torch.save(model_to_save.state_dict(), output_model_file)
else:
output_state_dict = {}
# for k, v in model_to_save.state_dict().items():
for k, v in model_to_save.named_parameters():
if hasattr(v, 'ds_id'):
with deepspeed.zero.GatheredParameters(_z3_params_to_fetch([v]), enabled=zero_stage_3):
v_p = v.data.cpu()
else:
v_p = v.cpu()
if global_rank == 0 and "lora" not in k:
print(f"key: {k}")
output_state_dict[k] = v_p
if global_rank == 0:
torch.save(output_state_dict, output_model_file)
del output_state_dict
If you use model_to_save.named_parameters()
to traverse model parameters, normally only parameters that can be learned and updated by the optimizer should be saved.
This place has two parameters missing in each self-attention layer.
k: transformer.wte.weight
k: transformer.wpe.weight
k: transformer.h.0.ln_1.weight
k: transformer.h.0.ln_1.bias
k: transformer.h.0.attn.c_attn.weight
k: transformer.h.0.attn.c_attn.bias
k: transformer.h.0.attn.c_proj.weight
k: transformer.h.0.attn.c_proj.bias
k: transformer.h.0.ln_2.weight
k: transformer.h.0.ln_2.bias
k: transformer.h.0.mlp.c_fc.weight
k: transformer.h.0.mlp.c_fc.bias
k: transformer.h.0.mlp.c_proj.weight
k: transformer.h.0.mlp.c_proj.bias
...
But if use model_to_save.state_dict().items()
to traverse the model parameters, under normal circumstances, all parameters should be saved, including both learnable parameters and non-learnable parameters.
But I found that some parameters don't seem to be collected, and the value is empty.

How can I solve this problem? please help me.