DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

Open cdj0311 opened this issue 2 years ago • 6 comments
trafficstars

hi, I have trained a GPT model with 4 nodes (8 gpus per node), but when I load the checkpoint with 6 nodes, I get the follows error:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 4 but the current world size is 6. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

cdj0311 avatar Feb 25 '23 10:02 cdj0311

hi, I have trained a GPT model with 4 nodes (8 gpus per node), but when I load the checkpoint with 6 nodes, I get the follows error:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 4 but the current world size is 6. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

hi, can you share your experience? I also encountered the same problem

chenzhanyiczy avatar Mar 01 '23 06:03 chenzhanyiczy

Hi, Can you share your solution to this? thanks

sasaadi avatar Mar 01 '23 18:03 sasaadi

Hi, can you share your solution to this problem? thanks. @cdj0311

zhao0306 avatar Mar 10 '23 04:03 zhao0306

Same problem.

JinmingZhao avatar Mar 21 '23 13:03 JinmingZhao

Same Issue

liutaocode avatar Apr 02 '23 16:04 liutaocode

Same here.

tommy-qichang avatar Apr 12 '23 19:04 tommy-qichang

same here +1

cmstudyscode avatar Apr 19 '23 06:04 cmstudyscode

same here + 1

Rafa-zy avatar May 10 '23 03:05 Rafa-zy

same here + 1

xjwang4 avatar May 18 '23 08:05 xjwang4

Same here. Is there any guidance on how to proceed?

gugarosa avatar May 19 '23 17:05 gugarosa

Can you share your solution to me? thanks. @cdj0311

young-chao avatar Jun 04 '23 09:06 young-chao

Any update on this? @cdj0311

macabdul9 avatar Jun 16 '23 13:06 macabdul9

The problem is caused by the ckpt is saved on 6 gpus but is reloaded with 4 gpus.

Try torch.load(..., map_location=[torch.device('cpu'))

kxgong avatar Dec 22 '23 08:12 kxgong

你好,你的邮件已收到,谢谢!

young-chao avatar Dec 22 '23 08:12 young-chao

same issue + 1

ybdesire avatar Feb 07 '24 02:02 ybdesire

Same here;

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 7 but the current world size is 8. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

timpal0l avatar Apr 05 '24 14:04 timpal0l

We recommend that you use DeepSpeed universal checkpoint.

samadejacobs avatar Apr 05 '24 19:04 samadejacobs