DeepSpeed
DeepSpeed copied to clipboard
[BUG] it took almost 1hour to :Initializing TorchBackend in DeepSpeed with backend nccl
Help me to eliminate this time consuming ,thx
me too? do you have solve the problem?
me too
Hi @Modas-Li, @alphanlp, and @NicholasYoungAI can you provide more information about your system and set up? Thanks
me too
Is your machine connected to the internet? I also got stuck in an offline environment at the beginning, but after the machines connected to the internet, quickly passed.
The server I'm using is unable to connect to the internet. Can I configure DeepSpeed to prevent the offline environment from slowing down the NCCL initialization process?
The server I'm using is unable to connect to the internet. Can I configure DeepSpeed to prevent the offline environment from slowing down the NCCL initialization process?
I think deepspeed.init_distributed
this API cause this pormble, find it and try to set timeout as 10 sec.
https://deepspeed.readthedocs.io/en/latest/initialize.html#deepspeed.init_distributed
Hi @Modas-Li, @alphanlp, and @NicholasYoungAI can you provide more information about your system and set up? Thanks
I have same problem when I t started DeepSpeed-Chat that bash training/step2_reward_model_finetuning/training_scripts\single_node\run.sh
this is detail: (it waste my exceed about 6hours)can't go on [2023-07-13 14:51:33,751] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=1,2,3: setting --include=localhost:1,2,3 [2023-07-13 14:51:33,954] [INFO] [runner.py:541:main] cmd = /home/zhangchang/.conda/envs/chatdeep/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/rm-static --model_name_or_path THUDM/chatglm-6b --num_padding_at_beginning 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 1 --disable_dropout --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 3 --deepspeed --output_dir ./output2 [2023-07-13 14:51:35,266] [INFO] [launch.py:222:main] 0 NCCL_IBEXT_DISABLE=1 [2023-07-13 14:51:35,267] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=1 [2023-07-13 14:51:35,267] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [1, 2, 3]} [2023-07-13 14:51:35,267] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=3, node_rank=0 [2023-07-13 14:51:35,267] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]}) [2023-07-13 14:51:35,267] [INFO] [launch.py:247:main] dist_world_size=3 [2023-07-13 14:51:35,267] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=1,2,3 [2023-07-13 14:51:37,178] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Can you please try collecting top -n
results using py-spy
while this issue is happening? This should help understanding what the process executing during this time. It would also be nice if you can generate a flame graph too.
Hi all, NCCL and internet are required to run DeepSpeed. We do not support running without these. Closing this issue. Thanks. https://github.com/microsoft/DeepSpeed/issues/4104
This problem occured when I using transformers 4.25, but it disappeared when I upgrade the transformers library to 4.26