FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

[WIP] Fixe FSDP saving error

Open zhisbug opened this issue 1 year ago • 2 comments

#476

zhisbug avatar Apr 25 '23 08:04 zhisbug

pending test by @ZYHowell

zhisbug avatar Apr 25 '23 08:04 zhisbug

@ZYHowell @zhisbug Any updates or close this?

merrymercy avatar May 08 '23 06:05 merrymercy

@merrymercy I can help with the test, since I had the same problem before. Update results later.


update

Try this PR with 4*A100(80G), training is ok, OOM when saving.

I might dig into this later.

alanxmay avatar May 13 '23 17:05 alanxmay

@merrymercy @zhisbug Tried several different settings using the FSDP API, all failed when saving the model.

But based on this comment, I finally managed to save the model with python3.10 and torch==2.0 by change /python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py on line 309 from state_dict[fqn] = state_dict[fqn].clone().detach() to state_dict[fqn] = state_dict[fqn].cpu().clone().detach()

Test machine: 4*A100(80G).

alanxmay avatar May 15 '23 05:05 alanxmay

@alanxmay this is just a workaround. Most of our users indeed used this workaround.

zhisbug avatar May 15 '23 10:05 zhisbug

Closing this PR, I am going to start a new PR with the fix.

zhisbug avatar May 15 '23 10:05 zhisbug