DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

Open hahchenchen opened this issue 1 year ago • 3 comments

hi, I have trained a LLM model with 4 nodes (8 gpus per node), but when I load the checkpoint with 16 nodes, I get the follows error:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 32 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

hahchenchen avatar Jun 26 '23 09:06 hahchenchen