CoLLiE How to convert parallel state_dict to normal state

How to convert parallel state_dict to normal state_dict?

Open JinchaoLove opened this issue 9 months ago • 3 comments

Hi, there! I saved parallel state_dict (requires_grad True only) with 8 GPUs remotely, how to load these state_dicts and save them as one locally? Thanks in advance.

collie_dp0_pp0_tp0.pt  collie_zero_dp0_pp0_tp0.pt  collie_zero_dp2_pp0_tp0.pt  collie_zero_dp4_pp0_tp0.pt  collie_zero_dp6_pp0_tp0.pt
collie.json            collie_zero_dp1_pp0_tp0.pt  collie_zero_dp3_pp0_tp0.pt  collie_zero_dp5_pp0_tp0.pt  collie_zero_dp7_pp0_tp0.pt

Sep 18 '23 13:09 JinchaoLove

Hi, the model weights should be saved in files like pytorch_model.bin with CheckpointCallback below.

callbacks = [CheckpointCallback(your_path, every_n_batches=1600, model_only=False,peft_only=False)]

BTW, are you using the main branch or dev branch? Recommend using dev now.

Sep 18 '23 14:09 KaiLv69

Hi, the model weights should be saved in files like pytorch_model.bin with CheckpointCallback below.
callbacks = [CheckpointCallback(your_path, every_n_batches=1600, model_only=False,peft_only=False)]
BTW, are you using the main branch or dev branch? Recommend using dev now.

Got it. I'm using the dev branch. So the aforementioned are all trainer state (not model weights) as defined in the Trainer. The issue caused by my filter method of if requires_grad, which is always False in state_dict.

self.checkpoint_file = "collie_dp{}_pp{}_tp{}.pt".format(env.dp_rank, env.pp_rank, env.tp_rank)  # Trainer state
state_dict = {n: p.detach().cpu() for n, p in model.state_dict().items() if p.requires_grad}  # always empty

Sep 18 '23 14:09 JinchaoLove

The topk in the CheckpointCallback defaults to 0, which will not save the model... I think it's better to set it to be 1 or -1 or raise a warning by default in case of misconfiguration.

Sep 19 '23 03:09 JinchaoLove

CoLLiE CoLLiE copied to clipboard

How to convert parallel state_dict to normal state_dict?

CoLLiE
CoLLiE copied to clipboard