DeepSpeed Fix checkpoint loading when zero optimizer states are not given

When I use DeepSpeed for the finetuning without giving the zero state checkpoints, the FP32 master parameter is not initialized properly. This PR fixes this issue.

Feb 16 '21 21:02 szhengac

@szhengac Thanks for this PR. Can you please explain the usage scenario a bit more? Is this loading a ZeRO checkpoint without providing the actual checkpoint files?

Feb 16 '21 23:02 tjruwase

@tjruwase DeepSpeed initializes the FP32 master weights in the engine when deepspeed.initialize is called. After that we use deepspeed load_checkpoint to load the model weights without giving the zero state checkpoints, which contain the FP32 master weight, and Adam momentum states. In this case, the FP16 weights are correct but FP32 master weights are still the random values. This causes the issue when we do the finetuning and has tripped us for over one month.

Feb 17 '21 01:02 szhengac

@szhengac I am really sorry to hear that this is blocking issue for over a month. My concern is that this PR creates a strange code path that could be hard to maintain and easy to break. This is why I asked for the scenario so that we can provide a better support for it.

It seems that you are trying to use ZeRO for finetuning based on a model that was pre-trained with ZeRO. However, you don't want to use the ZeRO checkpoint state which includes the fp32 params and Adam optimizer state. Is this correct?

Feb 17 '21 01:02 tjruwase

@tjruwase Yes. There are three reasons: 1) we may add additional layers in the finetuning, in which case the shape of the partition does not align so that we fail to load the zero checkpoints; 2) it does not make sense to initialize the finetuning using the optimizer states from the pretraining; 3) we may want to use different optimizer.

Feb 17 '21 01:02 szhengac

Got it. Thanks for sharing this scenario. In that case, what if you called optimizer.refresh_fp32_params() from your client script, after you load_checkpoint() returns. Can you check if this achieves your goal?

Feb 17 '21 01:02 tjruwase

@tjruwase This is what I originally did to make it work. But I think this is frustrating if load_checkpoint cannot handle such case.

Feb 17 '21 01:02 szhengac

@tjruwase there is one more possible bug we found yesterday that zero 2 gives much higher accuracy than zero 1 in the finetuning (all the hyperparameters are the same except that we change zero 2 to zero 1 in the deepspeed json config). We have tested it for several runs. We are still not clear why this happens. If you have any idea, it would be helpful.

Feb 17 '21 01:02 szhengac

@szhengac

I am happy to look into the ZeRO-1 regression. Can you open an issue and provide repro steps?

Regarding the original finetuning issue, here is my description of the scenario: "Supporting ZeRO-finetuning of a ZeRO-pretrained model, inheriting only the model weights (fp16 and fp32)".

Does this fully capture your use case? I will like to present this to the team so we can start building this support into DeepSpeed. Thanks.

Feb 17 '21 19:02 tjruwase

@tjruwase I will see if I can reproduce ZeRO-1 regression using public available dataset.

Yes. That description captures my case. Also, I would like to get your attention on my comment in this issue: https://github.com/microsoft/DeepSpeed/issues/684#issuecomment-764382238. Please let me know if what I understand is correct.

Thanks

Feb 17 '21 20:02 szhengac

Can one of the admins verify this patch?

Jun 09 '22 20:06 rocm-mici

Thanks for your contribution @szhengac. I am closing this fairly old/stale PR due to conflicts and perhaps new functionality has already been added. If you still find this relevant, please reopen the PR.

Aug 25 '23 16:08 awan-10

DeepSpeed DeepSpeed copied to clipboard

Fix checkpoint loading when zero optimizer states are not given

DeepSpeed
DeepSpeed copied to clipboard