torchacc icon indicating copy to clipboard operation
torchacc copied to clipboard

support save&load of fsdp_optim_state

Open hanwen-sun opened this issue 5 months ago • 1 comments

What this pr do:

  1. suport flatten(including padding before shard) and unflatten full_optim_state_dic save and load and test with ut.
  2. support save and load of shard_optim_state_dict.

TODO:

  1. test the memory usage of checkpointing 70b model.
  2. shard_param_on_dim_0(?)

hanwen-sun avatar Sep 03 '24 11:09 hanwen-sun