DeepSpeed Reshape ZeroStage=0 FP16 Checkpoint

What is the best way for reshaping a checkpoint trained with zero stage = 0 & fp16?

I see two options: a) Continue training with zero stage 1 for 1 step & adapt this PR to work with fp16 b) Adapt the script here to work without the need of zero ckpts; The difficult part will just be reshaping the optimizer states in the mp_rank files

Maybe @tjruwase could give me a quick hint if a) or b) makes more sense before I waste my time? Thanks!

Jun 20 '22 15:06 Muennighoff

@Muennighoff, thanks for your question. Can you please clarify a bit more because zero_stage=0 actually disables ZeRO and is pure DDP. The only reshaping needs that I can imagine in such cases will be due to tensor parallelism or pipeline parallelism.

Jun 20 '22 17:06 tjruwase

@Muennighoff, thanks for your question. Can you please clarify a bit more because zero_stage=0 actually disables ZeRO and is pure DDP. The only reshaping needs that I can imagine in such cases will be due to tensor parallelism or pipeline parallelism.

Yes there's no ZeRO used only TP & PP. The TP is based on the Megatron-DS implementation. Specifically, I am looking at a TP=4, PP=4 model. Based on my understanding I need to change the layer files due to TP & the mp files due to TP & MP.

How would you go about it?

Jun 20 '22 18:06 Muennighoff

Great. Thanks for the clarification. Also, do you need reshaping of just the model weights or also of the optimizer state? The reshaping logic you reference is split across bigscience/megatron-deepspeed and deepspeed, very new, and only tested with bf16 + pipeline parallelism + zero stage1.

In terms of your proposed options, I feel (b) is more straightforward and thus easier. Option (a) will require (1) creating zero ckpts only for the sake of reshaping and (2) porting the reshaping changes in the bf16_optimizer into fp16 zero_stage_1 optimizer. Although, option (b) requires changes to the reshaping script, I think those changes will be useful anyways for the non-zero training scenarios such as yours. Does that make sense?

Perhaps @stas00, who is the co-author of the reshaping feature, might have some thoughts as well.

Jun 20 '22 18:06 tjruwase

Yes, I need continue training in the new shape, so I think I will also need to reshape the optimizer states. I will continue training with zero stage 1, however.

Thanks for your thoughts! I will work on (b) then. I think I only need to figure out how to merge the optimizer states in the mp_rank files correctly.

Jun 20 '22 18:06 Muennighoff

Yes, once bf16/z0 PR is merged we can look at fp16/z0 next.

The other approach is to:

start with a random optim states
run for some steps with LR=0 to let the optimizer catch up
resume training with normal LR

The details and math of how many steps to run are in the 104B chronicles, I can dig up the link if you want to explore this option.

Jun 27 '22 22:06 stas00

DeepSpeed DeepSpeed copied to clipboard

Reshape ZeroStage=0 FP16 Checkpoint

DeepSpeed
DeepSpeed copied to clipboard