DeepSpeed [BUG] it took almost 1hour to :Initializing TorchBackend in DeepSpeed with backend nccl

Help me to eliminate this time consuming ,thx

Apr 20 '23 09:04 Modas-Li

me too? do you have solve the problem?

May 02 '23 16:05 alphanlp

me too

May 10 '23 14:05 NicholasYoungAI

Hi @Modas-Li, @alphanlp, and @NicholasYoungAI can you provide more information about your system and set up? Thanks

May 12 '23 18:05 molly-smith

me too

Is your machine connected to the internet? I also got stuck in an offline environment at the beginning, but after the machines connected to the internet, quickly passed.

May 18 '23 06:05 devinzhang91

The server I'm using is unable to connect to the internet. Can I configure DeepSpeed to prevent the offline environment from slowing down the NCCL initialization process?

May 25 '23 21:05 cxxz

The server I'm using is unable to connect to the internet. Can I configure DeepSpeed to prevent the offline environment from slowing down the NCCL initialization process?

I think deepspeed.init_distributed this API cause this pormble, find it and try to set timeout as 10 sec. https://deepspeed.readthedocs.io/en/latest/initialize.html#deepspeed.init_distributed

Jun 01 '23 03:06 devinzhang91

Hi @Modas-Li, @alphanlp, and @NicholasYoungAI can you provide more information about your system and set up? Thanks

I have same problem when I t started DeepSpeed-Chat that bash training/step2_reward_model_finetuning/training_scripts\single_node\run.sh

this is detail: (it waste my exceed about 6hours)can't go on [2023-07-13 14:51:33,751] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=1,2,3: setting --include=localhost:1,2,3 [2023-07-13 14:51:33,954] [INFO] [runner.py:541:main] cmd = /home/zhangchang/.conda/envs/chatdeep/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/rm-static --model_name_or_path THUDM/chatglm-6b --num_padding_at_beginning 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 1 --disable_dropout --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 3 --deepspeed --output_dir ./output2 [2023-07-13 14:51:35,266] [INFO] [launch.py:222:main] 0 NCCL_IBEXT_DISABLE=1 [2023-07-13 14:51:35,267] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=1 [2023-07-13 14:51:35,267] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [1, 2, 3]} [2023-07-13 14:51:35,267] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=3, node_rank=0 [2023-07-13 14:51:35,267] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]}) [2023-07-13 14:51:35,267] [INFO] [launch.py:247:main] dist_world_size=3 [2023-07-13 14:51:35,267] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=1,2,3 [2023-07-13 14:51:37,178] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl

Jul 13 '23 12:07 askxiaozhang

Can you please try collecting top -n results using py-spy while this issue is happening? This should help understanding what the process executing during this time. It would also be nice if you can generate a flame graph too.

Jul 13 '23 14:07 clumsy

Hi all, NCCL and internet are required to run DeepSpeed. We do not support running without these. Closing this issue. Thanks. https://github.com/microsoft/DeepSpeed/issues/4104

Sep 20 '23 23:09 molly-smith

This problem occured when I using transformers 4.25, but it disappeared when I upgrade the transformers library to 4.26

Jan 12 '24 02:01 kyriemao

DeepSpeed DeepSpeed copied to clipboard

[BUG] it took almost 1hour to :Initializing TorchBackend in DeepSpeed with backend nccl

DeepSpeed
DeepSpeed copied to clipboard