DeepSpeed
DeepSpeed copied to clipboard
Multi-node training reports "stop_waiting response required" and "connection reset by peer"
Describe the bug I would like to use remote machines in the cloud for finetuning. I am using a hostfile and have configured ssh for passwordless connection
Using the command deepspeed --hostfile=myHostfile --master_addr 178.116.84.30 --master_port 10700 run_clm.py --deepspeed ds_config_stage3.json ... (further arguments)
produces the following output:
deepspeed --hostfile=myHostfile --master_addr 178.116.84.30 --master_port 10700 run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path EleutherAI/gpt-j-6B --train_file train.txt --validation_file validation.txt --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned2 --num_train_epochs 4 --eval_steps 4 --gradient_accumulation_steps 32 --per_device_train_batch_size 8 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 10 --save_steps 48 --save_strategy steps --tokenizer_name gpt2 --load_best_model_at_end=True --block_size=2048
[2023-03-08 23:09:03,464] [INFO] [runner.py:549:main] cmd = /home/max/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJ2YXN0YWkiOiBbMCwgMV19 --master_addr=178.116.84.30 --master_port=10700 --enable_each_rank_log=None run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path EleutherAI/gpt-j-6B --train_file train.txt --validation_file validation.txt --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned2 --num_train_epochs 4 --eval_steps 4 --gradient_accumulation_steps 32 --per_device_train_batch_size 8 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 10 --save_steps 48 --save_strategy steps --tokenizer_name gpt2 --load_best_model_at_end=True --block_size=2048
[2023-03-08 23:09:04,879] [INFO] [launch.py:142:main] WORLD INFO DICT: {'vastai': [0, 1]}
[2023-03-08 23:09:04,879] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-03-08 23:09:04,879] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'vastai': [0, 1]})
[2023-03-08 23:09:04,879] [INFO] [launch.py:162:main] dist_world_size=2
[2023-03-08 23:09:04,879] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-03-08 23:09:09,759] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 625, in
To Reproduce Use the following files (IP address and port need to be changed): MyHostfile: vastai slots=2
ssh config file: Host vastai Hostname 178.116.84.30 User root Port 10700
Steps to reproduce the behavior: see above
Expected behavior finetuning should start
logging into the remote machine with "ssh vastai" works fine without any password insertion requirement
ds_report output (base) max@max-5824:/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/$ ds_report
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/home/max/anaconda3/lib/python3.9/site-packages/torch'] torch version .................... 1.13.0 deepspeed install path ........... ['/home/max/anaconda3/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.8.1+867da307, 867da307, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU 2 x RTX 3090 on remote machine
- Python 3.9
- Remote machine is somewhere in the cloud
Docker context No Docker image
I have met the same question. Adding "localhost" in hostfile can solve the problem.
localhost slots=1
vastai slots=2
@maxmaier59 did the suggestion from @shisi-cc fix the problem?