DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Can't load checkpoint without having shared filesystem in multi-node training when multi-node setup config remains same

Open pacman100 opened this issue 3 years ago • 3 comments

Describe the bug

  1. Background: Use deepspeed (use ZeRO-1) for multi-node training, save optimizers to resume training.
  2. save_checkpoint only saves the partitioned optimizer state on each machine. if we use 32 gpus (4 node), each node only saves 8 partitions. For the 1st node, it saves bf16_zero_pp_rank_(0-7)_mp_rank_00_optim_states.pt. When call load_state to resume training for deepspeed, the 1 st node reports the issue of The following zero checkpoints paths are missing: ['./output/test/step_5/pytorch_model/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt', './output/test/step_5/pytorch_model/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt']

Expected behavior load checkpoint when having same multi-node setup without the shared filesystem

Launcher context 🤗 Accelerate integration of DeepSpeed

Additional context https://discuss.huggingface.co/t/questions-about-deepspeed-resume-training/22765 https://discuss.huggingface.co/t/deepspeed-resume-training-from-saved-states/22768

pacman100 avatar Sep 13 '22 05:09 pacman100

@tjruwase

pacman100 avatar Sep 13 '22 05:09 pacman100

@pacman100, can you please confirm what you observe with this issue. Does the run crash or simply hang? I am trying to match with my observations. Thanks!

tjruwase avatar Sep 13 '22 22:09 tjruwase

Pinging @cyk1337 to provide more context as they are the one who are experiencing this issue.

pacman100 avatar Sep 14 '22 05:09 pacman100

Hi, I tried to resume deepspeed multi-node training (using ZeRO) without having shared filesystem. The optimizer states have saved separately in multi-nodes. It requires manual gathering from all of the multi-nodes. Is there any solution to it? Thanks!

cyk1337 avatar Oct 21 '22 04:10 cyk1337

@cyk1337 The PR that was created for this bug will allow you to add use_node_local_storage=True to the checkpointing section of the ds_config. By adding this flag DeepSpeed will save checkpoints associated with the local ranks on each node and reload them without the need for manual intervention.

jomayeri avatar Oct 21 '22 19:10 jomayeri