DeepSpeed
DeepSpeed copied to clipboard
Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
hi, I have trained a GPT model with 4 nodes (8 gpus per node), but when I load the checkpoint with 6 nodes, I get the follows error:
deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 4 but the current world size is 6. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
hi, I have trained a GPT model with 4 nodes (8 gpus per node), but when I load the checkpoint with 6 nodes, I get the follows error:
deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 4 but the current world size is 6. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
hi, can you share your experience? I also encountered the same problem
Hi, Can you share your solution to this? thanks
Hi, can you share your solution to this problem? thanks. @cdj0311
Same problem.
Same Issue
Same here.
same here +1
same here + 1
same here + 1
Same here. Is there any guidance on how to proceed?
Can you share your solution to me? thanks. @cdj0311
Any update on this? @cdj0311
The problem is caused by the ckpt is saved on 6 gpus but is reloaded with 4 gpus.
Try torch.load(..., map_location=[torch.device('cpu'))
你好,你的邮件已收到,谢谢!
same issue + 1
Same here;
deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 7 but the current world size is 8. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
We recommend that you use DeepSpeed universal checkpoint.