DeepSpeed
DeepSpeed copied to clipboard
Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
hi, I have trained a LLM model with 4 nodes (8 gpus per node), but when I load the checkpoint with 16 nodes, I get the follows error:
deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 32 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
@stas00 Can you help to sovle this problem? Thanks
You're giving too little information to go about. You trained it how? I assume with ZeRO-3
And you're now trying to load the model using the deepspeed checkpoint? Unfortunately changing the topology after the training has started isn't yet supported by Deepspeed - please see this feature request https://github.com/microsoft/DeepSpeed/issues/2921
So meanwhile the only thing you can do is to extract the fp32 weights using zero_to_fp32.py that you will find in the checkpoint folder and start a new training (or inference) using this extracted checkpoint. So if you wanted to continue using the optimizer you can't.
You can read about the extraction here:
https://huggingface.co/docs/transformers/main/en/main_classes/deepspeed#getting-the-model-weights-out
scroll down to the Offline FP32 Weights Recovery: section.
You can add a comment in https://github.com/microsoft/DeepSpeed/issues/2921 and request that this will be implemented - the more users ask for it the higher are the chances it'll get implemented.
how to solve
raise ZeRORuntimeException("The checkpoint being loaded used a DP "
[rank5]: deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 8 but the current world size is 16. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
@ArtificialZeng @hahchenchen You can now resume training with different DP size (differnet Node size) via Universal Checkpointing.
You can find more examples in Megatron-DeepSpeed repo.
@tjruwase I believe this issue could be closed :).