DeepSpeed
DeepSpeed copied to clipboard
Fix checkpoint loading when zero optimizer states are not given
When I use DeepSpeed for the finetuning without giving the zero state checkpoints, the FP32 master parameter is not initialized properly. This PR fixes this issue.
@szhengac Thanks for this PR. Can you please explain the usage scenario a bit more? Is this loading a ZeRO checkpoint without providing the actual checkpoint files?
@tjruwase DeepSpeed initializes the FP32 master weights in the engine when deepspeed.initialize is called. After that we use deepspeed load_checkpoint to load the model weights without giving the zero state checkpoints, which contain the FP32 master weight, and Adam momentum states. In this case, the FP16 weights are correct but FP32 master weights are still the random values. This causes the issue when we do the finetuning and has tripped us for over one month.
@szhengac I am really sorry to hear that this is blocking issue for over a month. My concern is that this PR creates a strange code path that could be hard to maintain and easy to break. This is why I asked for the scenario so that we can provide a better support for it.
It seems that you are trying to use ZeRO for finetuning based on a model that was pre-trained with ZeRO. However, you don't want to use the ZeRO checkpoint state which includes the fp32 params and Adam optimizer state. Is this correct?
@tjruwase Yes. There are three reasons: 1) we may add additional layers in the finetuning, in which case the shape of the partition does not align so that we fail to load the zero checkpoints; 2) it does not make sense to initialize the finetuning using the optimizer states from the pretraining; 3) we may want to use different optimizer.
Got it. Thanks for sharing this scenario. In that case, what if you called optimizer.refresh_fp32_params() from your client script, after you load_checkpoint() returns. Can you check if this achieves your goal?
@tjruwase This is what I originally did to make it work. But I think this is frustrating if load_checkpoint cannot handle such case.
@tjruwase there is one more possible bug we found yesterday that zero 2 gives much higher accuracy than zero 1 in the finetuning (all the hyperparameters are the same except that we change zero 2 to zero 1 in the deepspeed json config). We have tested it for several runs. We are still not clear why this happens. If you have any idea, it would be helpful.
@szhengac
I am happy to look into the ZeRO-1 regression. Can you open an issue and provide repro steps?
Regarding the original finetuning issue, here is my description of the scenario: "Supporting ZeRO-finetuning of a ZeRO-pretrained model, inheriting only the model weights (fp16 and fp32)".
Does this fully capture your use case? I will like to present this to the team so we can start building this support into DeepSpeed. Thanks.
@tjruwase I will see if I can reproduce ZeRO-1 regression using public available dataset.
Yes. That description captures my case. Also, I would like to get your attention on my comment in this issue: https://github.com/microsoft/DeepSpeed/issues/684#issuecomment-764382238. Please let me know if what I understand is correct.
Thanks
Can one of the admins verify this patch?
Thanks for your contribution @szhengac. I am closing this fairly old/stale PR due to conflicts and perhaps new functionality has already been added. If you still find this relevant, please reopen the PR.