DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

Open hahchenchen opened this issue 2 years ago • 3 comments
trafficstars

hi, I have trained a LLM model with 4 nodes (8 gpus per node), but when I load the checkpoint with 16 nodes, I get the follows error:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 32 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

hahchenchen avatar Jun 26 '23 09:06 hahchenchen

@stas00 Can you help to sovle this problem? Thanks

hahchenchen avatar Jun 28 '23 07:06 hahchenchen

You're giving too little information to go about. You trained it how? I assume with ZeRO-3

And you're now trying to load the model using the deepspeed checkpoint? Unfortunately changing the topology after the training has started isn't yet supported by Deepspeed - please see this feature request https://github.com/microsoft/DeepSpeed/issues/2921

So meanwhile the only thing you can do is to extract the fp32 weights using zero_to_fp32.py that you will find in the checkpoint folder and start a new training (or inference) using this extracted checkpoint. So if you wanted to continue using the optimizer you can't.

You can read about the extraction here: https://huggingface.co/docs/transformers/main/en/main_classes/deepspeed#getting-the-model-weights-out scroll down to the Offline FP32 Weights Recovery: section.

You can add a comment in https://github.com/microsoft/DeepSpeed/issues/2921 and request that this will be implemented - the more users ask for it the higher are the chances it'll get implemented.

stas00 avatar Jun 28 '23 20:06 stas00

how to solve

ArtificialZeng avatar Aug 20 '24 10:08 ArtificialZeng

raise ZeRORuntimeException("The checkpoint being loaded used a DP "
[rank5]: deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 8 but the current world size is 16. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

ArtificialZeng avatar Aug 26 '24 06:08 ArtificialZeng

@ArtificialZeng @hahchenchen You can now resume training with different DP size (differnet Node size) via Universal Checkpointing.

You can find more examples in Megatron-DeepSpeed repo.

xylian86 avatar Oct 09 '24 16:10 xylian86

@tjruwase I believe this issue could be closed :).

xylian86 avatar Oct 09 '24 16:10 xylian86