ColossalAI
ColossalAI copied to clipboard
[BUG]: pytorch单机多卡问题:ERROR: torch.distributed.elastic.multiprocessing.api:failed
🐛 Describe the bug
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The client socket has failed to connect to any network address of (i-0b9e876c, 57748). The IPv6 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known). The IPv4 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51863) of binary: /home/whong/anaconda3/envs/chatgpt/bin/python
Traceback (most recent call last):
File "/home/whong/anaconda3/envs/chatgpt/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./examples/train_reward_model.py FAILED
Failures: [1]: time : 2023-03-23_15:36:49 host : i-0B9E876C rank : 1 (local_rank: 1) exitcode : 1 (pid: 51864) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-03-23_15:36:49 host : i-0B9E876C rank : 0 (local_rank: 0) exitcode : 1 (pid: 51863) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Environment
No response
The port might have been occupied. Can you try running with a different port number?
The port might have been occupied. Can you try running with a different port number?
Ok, I'll give it a try.Thank you.
The port might have been occupied. Can you try running with a different port number?
Sorry, what do you mean by occupied port here?
The port number for which you launch the processes.
The port number for which you launch the processes.
I checked and found that none of the four GPU ports were occupied. Why is that? And after changing the port, the same error is still reported.
When using docker env to run, can you append --network=host
to your command?
same problem
+1
same problem @JThh
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
same problem @JThh
same problem. It seems that using single node single trainer is fine, but when nproc_per_node > 1, l got the same error.
world_size = int(os.environ["WORLD_SIZE"])
mp.spawn(main_worker, args=(world_size, args), nprocs=world_size)
This is my main function to start distributed training, and when calling "spawn", it will pass an index aside from args
to the function, in this case is main_worker
, which should be defined like this:
def main_worker(i, world_size, args):
Then set device in main_worker
function, and move the model to the device like this:
torch.cuda.set_device(i)
model = model.to(device)
I solved this problem by doing so, hope it can help.
+1
Thanks @Honee-W for sharing. I understand the issue better now.
model = model.to(torch.cuda.get_currect_device())
would suffice. Would this be useful for you @Youly172 ?
邮件已收到~李巧艳
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
The mail has been received~ Li Qiaoyan
+1
+1
邮件已收到~李巧艳
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
The mail has been received~ Li Qiaoyan
邮件已收到~李巧艳
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
The mail has been received~ Li Qiaoyan
这个问题我也碰到了,怎么解决的呀
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
I also encountered this problem, how did I solve it?
邮件已收到~李巧艳
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
The email has been received~Li Qiaoyan
(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh tcp://127.0.0.1:10031, ws:4, rank:0 tcp://127.0.0.1:10031, ws:4, rank:1 tcp://127.0.0.1:10031, ws:4, rank:2 tcp://127.0.0.1:10031, ws:4, rank:3
Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)
{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}
Creating dataset
Traceback (most recent call last):
File "train.py", line 557, in
-- Process 3 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained resolved_vocab_files[file_id] = cached_path( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path output_path = get_from_cache( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache raise ValueError( ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
怎么解决呀?????呜呜呜呜卡好久了 how to deal it?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh tcp://127.0.0.1:10031, ws:4, rank:0 tcp://127.0.0.1:10031, ws:4, rank:1 tcp://127.0.0.1:10031, ws:4, rank:2 tcp://127.0.0.1:10031, ws:4, rank:3
Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)
{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}
Creating dataset
Traceback (most recent call last):
File "train.py", line 557, in
-- Process 3 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained resolved_vocab_files[file_id] = cached_path( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path output_path = get_from_cache( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache raise ValueError( ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
怎么解决呀?????呜呜呜呜卡好久了 how to deal it?
邮件已收到~李巧艳
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
The email has been received~Li Qiaoyan