FastChat [WIP] Fixe FSDP saving error

[WIP] Fixe FSDP saving error

Open zhisbug opened this issue 1 year ago • 2 comments

#476

Apr 25 '23 08:04 zhisbug

pending test by @ZYHowell

Apr 25 '23 08:04 zhisbug

@ZYHowell @zhisbug Any updates or close this?

May 08 '23 06:05 merrymercy

@merrymercy I can help with the test, since I had the same problem before. Update results later.

update

Try this PR with 4*A100(80G), training is ok, OOM when saving.

I might dig into this later.

May 13 '23 17:05 alanxmay

@merrymercy @zhisbug Tried several different settings using the FSDP API, all failed when saving the model.

But based on this comment, I finally managed to save the model with python3.10 and torch==2.0 by change /python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py on line 309 from state_dict[fqn] = state_dict[fqn].clone().detach() to state_dict[fqn] = state_dict[fqn].cpu().clone().detach()

Test machine: 4*A100(80G).

May 15 '23 05:05 alanxmay

@alanxmay this is just a workaround. Most of our users indeed used this workaround.

May 15 '23 10:05 zhisbug

Closing this PR, I am going to start a new PR with the fix.

May 15 '23 10:05 zhisbug

FastChat FastChat copied to clipboard

[WIP] Fixe FSDP saving error

FastChat
FastChat copied to clipboard