DeepSpeed Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

Open cdj0311 opened this issue 2 years ago • 6 comments

trafficstars

hi, I have trained a GPT model with 4 nodes (8 gpus per node), but when I load the checkpoint with 6 nodes, I get the follows error:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 4 but the current world size is 6. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

Feb 25 '23 10:02 cdj0311

hi, I have trained a GPT model with 4 nodes (8 gpus per node), but when I load the checkpoint with 6 nodes, I get the follows error:
deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 4 but the current world size is 6. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

hi, can you share your experience? I also encountered the same problem

Mar 01 '23 06:03 chenzhanyiczy

Hi, Can you share your solution to this? thanks

Mar 01 '23 18:03 sasaadi

Hi, can you share your solution to this problem? thanks. @cdj0311

Mar 10 '23 04:03 zhao0306

Same problem.

Mar 21 '23 13:03 JinmingZhao

Same Issue

Apr 02 '23 16:04 liutaocode

Same here.

Apr 12 '23 19:04 tommy-qichang

same here +1

Apr 19 '23 06:04 cmstudyscode

same here + 1

May 10 '23 03:05 Rafa-zy

same here + 1

May 18 '23 08:05 xjwang4

Same here. Is there any guidance on how to proceed?

May 19 '23 17:05 gugarosa

Can you share your solution to me? thanks. @cdj0311

Jun 04 '23 09:06 young-chao

Any update on this? @cdj0311

Jun 16 '23 13:06 macabdul9

The problem is caused by the ckpt is saved on 6 gpus but is reloaded with 4 gpus.

Try torch.load(..., map_location=[torch.device('cpu'))

Dec 22 '23 08:12 kxgong

你好，你的邮件已收到，谢谢！

Dec 22 '23 08:12 young-chao

same issue + 1

Feb 07 '24 02:02 ybdesire

Same here;

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 7 but the current world size is 8. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

Apr 05 '24 14:04 timpal0l

We recommend that you use DeepSpeed universal checkpoint.

Apr 05 '24 19:04 samadejacobs

DeepSpeed DeepSpeed copied to clipboard

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

DeepSpeed
DeepSpeed copied to clipboard