CoLLiE icon indicating copy to clipboard operation
CoLLiE copied to clipboard

How to convert parallel state_dict to normal state_dict?

Open JinchaoLove opened this issue 9 months ago • 3 comments

Hi, there! I saved parallel state_dict (requires_grad True only) with 8 GPUs remotely, how to load these state_dicts and save them as one locally? Thanks in advance.

collie_dp0_pp0_tp0.pt  collie_zero_dp0_pp0_tp0.pt  collie_zero_dp2_pp0_tp0.pt  collie_zero_dp4_pp0_tp0.pt  collie_zero_dp6_pp0_tp0.pt
collie.json            collie_zero_dp1_pp0_tp0.pt  collie_zero_dp3_pp0_tp0.pt  collie_zero_dp5_pp0_tp0.pt  collie_zero_dp7_pp0_tp0.pt

JinchaoLove avatar Sep 18 '23 13:09 JinchaoLove

Hi, the model weights should be saved in files like pytorch_model.bin with CheckpointCallback below.

callbacks = [CheckpointCallback(your_path, every_n_batches=1600, model_only=False,peft_only=False)]

BTW, are you using the main branch or dev branch? Recommend using dev now.

KaiLv69 avatar Sep 18 '23 14:09 KaiLv69

Hi, the model weights should be saved in files like pytorch_model.bin with CheckpointCallback below.

callbacks = [CheckpointCallback(your_path, every_n_batches=1600, model_only=False,peft_only=False)]

BTW, are you using the main branch or dev branch? Recommend using dev now.

Got it. I'm using the dev branch. So the aforementioned are all trainer state (not model weights) as defined in the Trainer. The issue caused by my filter method of if requires_grad, which is always False in state_dict.

self.checkpoint_file = "collie_dp{}_pp{}_tp{}.pt".format(env.dp_rank, env.pp_rank, env.tp_rank)  # Trainer state
state_dict = {n: p.detach().cpu() for n, p in model.state_dict().items() if p.requires_grad}  # always empty

JinchaoLove avatar Sep 18 '23 14:09 JinchaoLove

The topk in the CheckpointCallback defaults to 0, which will not save the model... I think it's better to set it to be 1 or -1 or raise a warning by default in case of misconfiguration.

JinchaoLove avatar Sep 19 '23 03:09 JinchaoLove