DeepSpeed Why not save frozen params unless: `self.zero_optimization

I've spent 2 days drilling into why my frozen params aren't getting saved, and it comes down to this line:

https://github.com/microsoft/DeepSpeed/blob/c632ea09f8d107d10f76aa2b776e4df3c1ccf98a/deepspeed/runtime/engine.py#L3297C1-L3297C107

        save_frozen_param = self.zero_optimization_partition_gradients() and not exclude_frozen_parameters

exclude_frozen_parameters is therefore misleading, since that is not the only determinant of whether frozen params get saved.

To make matters more confusing, I am using deepspeed 2, but if I make a breakpoint in that zero_optimization_partiotion_gradients function, I see:

(Pdb) self.zero_optimization_stage()
1
(Pdb) ZeroStageEnum.gradients
<ZeroStageEnum.gradients: 2>

Why is this, and is there a straightforward non-hacky solution to get frozen params to save?

Apr 19 '24 22:04 freckletonj

@freckletonj, thanks for reporting this issue. I agree it is quite confusing, sorry about that. Unfortunately, I can't remember the rationale for including self.zero_optimization_partition_gradients() in the conditional logic.

Can you please clarify what you mean by "deepspeed 2"? Do you mean you are using zero stage 2? Can you please share your ds_config? Your breakpoint printout suggests that you are running zero stage 1.

Apr 19 '24 23:04 tjruwase

@tjruwase thanks for the fast response!

yes I'm using zero stage 2 via pytorch lightning, with a config.yaml:

trainer:
  accelerator: gpu
  devices: auto
  num_nodes: 1
  strategy: deepspeed_stage_2
...

I was surprised to see the breakpoint print that i'm in stage 1, but i think that's a separate issue from the confusing conditional logic.

And there's a chance I'm just going about this all wrong, I'm new to both lightning and deepspeed, so, forgive me I'm probably overlooking something important :)

To clarify, my only concern is how to save frozen params along with the model.

Some more background: I'm working on the RWKV project, a fork, where they save the weights with a copy of zero_to_fp32.py.

I've added a hack into this file to keep the params, which do live at the state dict's 'module' key, but not under FROZEN_PARAM_SHAPES, where they'd get picked up automatically: https://github.com/RWKV/RWKV-infctx-trainer/commit/51f9173117b1cdaf0ba0602348a7b1cf56bff042#diff-d1b1e811618e950083898fd2b934639a17307a0d339ee61aa96f3d7539463e26R142

I've also tried lightning's version of this function, but it also drops the frozen params: https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.utilities.deepspeed.html#lightning.pytorch.utilities.deepspeed.convert_zero_checkpoint_to_fp32_state_dict

Apr 19 '24 23:04 freckletonj

Some more background: I'm working on the RWKV project, a fork, where they save the weights with a copy of zero_to_fp32.py.

@freckletonj, apologies for the delayed response here. Is this the RWKV project? https://github.com/BlinkDL/RWKV-LM.

Can you please share your current status? Can you provide repro steps for us? Thanks!

May 13 '24 13:05 tjruwase

DeepSpeed
DeepSpeed copied to clipboard

Why not save frozen params unless: `self.zero_optimization_stage() >= ZeroStageEnum.gradients`?

DeepSpeed DeepSpeed copied to clipboard

Why not save frozen params unless: `self.zero_optimization_stage() >= ZeroStageEnum.gradients`?

DeepSpeed
DeepSpeed copied to clipboard