🐛 Describe the bug

File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store hostname, port, world_size, start_daemon, timeout, multi_tenant=True RuntimeError: The client socket has failed to connect to any network address of (i-0b9e876c, 57748). The IPv6 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known). The IPv4 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known). ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51863) of binary: /home/whong/anaconda3/envs/chatgpt/bin/python Traceback (most recent call last): File "/home/whong/anaconda3/envs/chatgpt/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')()) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./examples/train_reward_model.py FAILED

Failures: [1]: time : 2023-03-23_15:36:49 host : i-0B9E876C rank : 1 (local_rank: 1) exitcode : 1 (pid: 51864) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-03-23_15:36:49 host : i-0B9E876C rank : 0 (local_rank: 0) exitcode : 1 (pid: 51863) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

No response

Mar 23 '23 07:03 rabeisabigfool

The port might have been occupied. Can you try running with a different port number?

Mar 23 '23 09:03 JThh

Ok, I'll give it a try.Thank you.

Mar 24 '23 01:03 rabeisabigfool

Sorry, what do you mean by occupied port here?

Mar 24 '23 06:03 rabeisabigfool

The port number for which you launch the processes.

Mar 24 '23 07:03 JThh

I checked and found that none of the four GPU ports were occupied. Why is that? And after changing the port, the same error is still reported.

Mar 24 '23 07:03 rabeisabigfool

When using docker env to run, can you append --network=host to your command?

Mar 27 '23 06:03 JThh

same problem

Mar 31 '23 02:03 scarydemon2

+1

Mar 31 '23 13:03 cauyxy

same problem @JThh

Apr 02 '23 15:04 akk-123

same problem @JThh

Apr 02 '23 15:04 Issues-translate-bot

same problem. It seems that using single node single trainer is fine, but when nproc_per_node > 1, l got the same error.

Apr 03 '23 08:04 Honee-W

world_size = int(os.environ["WORLD_SIZE"]) mp.spawn(main_worker, args=(world_size, args), nprocs=world_size) This is my main function to start distributed training, and when calling "spawn", it will pass an index aside from args to the function, in this case is main_worker, which should be defined like this: def main_worker(i, world_size, args): Then set device in main_worker function, and move the model to the device like this: torch.cuda.set_device(i) model = model.to(device) I solved this problem by doing so, hope it can help.

Apr 06 '23 01:04 Honee-W

+1

Apr 11 '23 15:04 Youly172

Thanks @Honee-W for sharing. I understand the issue better now.

model = model.to(torch.cuda.get_currect_device()) would suffice. Would this be useful for you @Youly172 ?

Apr 14 '23 03:04 JThh

邮件已收到~李巧艳

Apr 14 '23 03:04 Youly172

The mail has been received~ Li Qiaoyan

Apr 14 '23 03:04 Issues-translate-bot

+1

Apr 18 '23 11:04 ifromeast

+1

May 08 '23 14:05 Ozawa333

邮件已收到~李巧艳

May 08 '23 14:05 Youly172

The mail has been received~ Li Qiaoyan

May 08 '23 14:05 Issues-translate-bot

邮件已收到~李巧艳

Jun 20 '23 09:06 Youly172

The mail has been received~ Li Qiaoyan

Jun 20 '23 09:06 Issues-translate-bot

这个问题我也碰到了，怎么解决的呀

Oct 13 '23 08:10 ALLISWELL8

I also encountered this problem, how did I solve it?

Oct 13 '23 08:10 Issues-translate-bot

邮件已收到~李巧艳

Oct 13 '23 08:10 Youly172

The email has been received~Li Qiaoyan

Oct 13 '23 08:10 Issues-translate-bot

(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh tcp://127.0.0.1:10031, ws:4, rank:0 tcp://127.0.0.1:10031, ws:4, rank:1 tcp://127.0.0.1:10031, ws:4, rank:2 tcp://127.0.0.1:10031, ws:4, rank:3

Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)

{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}

Creating dataset Traceback (most recent call last): File "train.py", line 557, in mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args, config)) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained resolved_vocab_files[file_id] = cached_path( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path output_path = get_from_cache( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache raise ValueError( ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

怎么解决呀？？？？？呜呜呜呜卡好久了 how to deal it？

Jul 29 '24 15:07 Yizhichaoai

(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh tcp://127.0.0.1:10031, ws:4, rank:0 tcp://127.0.0.1:10031, ws:4, rank:1 tcp://127.0.0.1:10031, ws:4, rank:2 tcp://127.0.0.1:10031, ws:4, rank:3

Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)

{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}

Creating dataset Traceback (most recent call last): File "train.py", line 557, in mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args, config)) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained resolved_vocab_files[file_id] = cached_path( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path output_path = get_from_cache( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache raise ValueError( ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

怎么解决呀？？？？？呜呜呜呜卡好久了 how to deal it？

Jul 29 '24 15:07 Issues-translate-bot

邮件已收到~李巧艳

Jul 29 '24 15:07 Youly172

The email has been received~Li Qiaoyan

Jul 29 '24 15:07 Issues-translate-bot

ColossalAI
ColossalAI copied to clipboard

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed

🐛 Describe the bug

./examples/train_reward_model.py FAILED

Failures: [1]: time : 2023-03-23_15:36:49 host : i-0B9E876C rank : 1 (local_rank: 1) exitcode : 1 (pid: 51864) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

ColossalAI ColossalAI copied to clipboard

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed

🐛 Describe the bug

./examples/train_reward_model.py FAILED

Failures: [1]: time : 2023-03-23_15:36:49 host : i-0B9E876C rank : 1 (local_rank: 1) exitcode : 1 (pid: 51864) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

ColossalAI
ColossalAI copied to clipboard