DeepSpeed
DeepSpeed copied to clipboard
Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
hi, I have trained a LLM model with 4 nodes (8 gpus per node), but when I load the checkpoint with 16 nodes, I get the follows error:
deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 32 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.