DeepSpeed
DeepSpeed copied to clipboard
Why not save frozen params unless: `self.zero_optimization_stage() >= ZeroStageEnum.gradients`?
I've spent 2 days drilling into why my frozen params aren't getting saved, and it comes down to this line:
https://github.com/microsoft/DeepSpeed/blob/c632ea09f8d107d10f76aa2b776e4df3c1ccf98a/deepspeed/runtime/engine.py#L3297C1-L3297C107
save_frozen_param = self.zero_optimization_partition_gradients() and not exclude_frozen_parameters
exclude_frozen_parameters is therefore misleading, since that is not the only determinant of whether frozen params get saved.
To make matters more confusing, I am using deepspeed 2, but if I make a breakpoint in that zero_optimization_partiotion_gradients function, I see:
(Pdb) self.zero_optimization_stage()
1
(Pdb) ZeroStageEnum.gradients
<ZeroStageEnum.gradients: 2>
Why is this, and is there a straightforward non-hacky solution to get frozen params to save?
@freckletonj, thanks for reporting this issue. I agree it is quite confusing, sorry about that. Unfortunately, I can't remember the rationale for including self.zero_optimization_partition_gradients() in the conditional logic.
Can you please clarify what you mean by "deepspeed 2"? Do you mean you are using zero stage 2? Can you please share your ds_config? Your breakpoint printout suggests that you are running zero stage 1.
@tjruwase thanks for the fast response!
yes I'm using zero stage 2 via pytorch lightning, with a config.yaml:
trainer:
accelerator: gpu
devices: auto
num_nodes: 1
strategy: deepspeed_stage_2
...
I was surprised to see the breakpoint print that i'm in stage 1, but i think that's a separate issue from the confusing conditional logic.
And there's a chance I'm just going about this all wrong, I'm new to both lightning and deepspeed, so, forgive me I'm probably overlooking something important :)
To clarify, my only concern is how to save frozen params along with the model.
Some more background: I'm working on the RWKV project, a fork, where they save the weights with a copy of zero_to_fp32.py.
I've added a hack into this file to keep the params, which do live at the state dict's 'module' key, but not under FROZEN_PARAM_SHAPES, where they'd get picked up automatically: https://github.com/RWKV/RWKV-infctx-trainer/commit/51f9173117b1cdaf0ba0602348a7b1cf56bff042#diff-d1b1e811618e950083898fd2b934639a17307a0d339ee61aa96f3d7539463e26R142
I've also tried lightning's version of this function, but it also drops the frozen params: https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.utilities.deepspeed.html#lightning.pytorch.utilities.deepspeed.convert_zero_checkpoint_to_fp32_state_dict
Some more background: I'm working on the RWKV project, a fork, where they save the weights with a copy of
zero_to_fp32.py.
@freckletonj, apologies for the delayed response here. Is this the RWKV project? https://github.com/BlinkDL/RWKV-LM.
Can you please share your current status? Can you provide repro steps for us? Thanks!