FastChat
FastChat copied to clipboard
[WIP] Fixe FSDP saving error
#476
pending test by @ZYHowell
@ZYHowell @zhisbug Any updates or close this?
@merrymercy I can help with the test, since I had the same problem before. Update results later.
update
Try this PR with 4*A100(80G), training is ok, OOM when saving.
I might dig into this later.
@merrymercy @zhisbug Tried several different settings using the FSDP API, all failed when saving the model.
But based on this comment, I finally managed to save the model with python3.10 and torch==2.0 by change /python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py
on line 309 from state_dict[fqn] = state_dict[fqn].clone().detach()
to state_dict[fqn] = state_dict[fqn].cpu().clone().detach()
Test machine: 4*A100(80G).
@alanxmay this is just a workaround. Most of our users indeed used this workaround.
Closing this PR, I am going to start a new PR with the fix.