DeepSpeed In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them.

Experimental environment: Two Ubuntu GPU servers Experimental code source: https://github.com/OvJat/DeepSpeedTutorial.git Fault Description: I used engine. save() to save the model training status to the specified path, and then used engine. load() to load the training status. The following error was reported, and all the fault information is provided below. (Note: In single machine DeepSpeed training, the above process does not report any errors and can run normally, but it does not work in multi machine training)

Fault information: 8.139.254.37: [2024-07-07 22:27:26,103] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,105] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,106] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,108] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,109] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt... 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt. 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [engine.py:3018:_get_all_zero_checkpoint_state_dicts] successfully read 2 ZeRO state_dicts for rank 0 8.139.254.37: [2024-07-07 22:27:26,277] [INFO] [engine.py:2968:_load_zero_checkpoint] loading 2 zero partition checkpoints for rank 0 8.139.254.37: terminate called after throwing an instance of 'gloo::EnforceNotMet' 8.139.254.37: what(): [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 1664 vs 1536 8.149.133.95: [rank1]: Traceback (most recent call last): 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 203, in 8.149.133.95: [rank1]: main() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 199, in main 8.149.133.95: [rank1]: train() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 178, in train

### Tasks

Jul 07 '24 14:07 WhaleSpring

@WhaleSpring, can you clarify two things to help debugging.

Are checkpoints saved onto local disk? The load logs contain /tmp/ references which is typically a node local storage.
Is this using gloo instead of nccl? The logs reference gloo.

Aug 03 '24 22:08 tjruwase

Closing for lack of response. Please re-open if needed.

Dec 11 '24 03:12 tjruwase