ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: pytorch单机多卡问题:ERROR: torch.distributed.elastic.multiprocessing.api:failed

Open rabeisabigfool opened this issue 1 year ago • 17 comments

🐛 Describe the bug

File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store hostname, port, world_size, start_daemon, timeout, multi_tenant=True RuntimeError: The client socket has failed to connect to any network address of (i-0b9e876c, 57748). The IPv6 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known). The IPv4 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known). ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51863) of binary: /home/whong/anaconda3/envs/chatgpt/bin/python Traceback (most recent call last): File "/home/whong/anaconda3/envs/chatgpt/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')()) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./examples/train_reward_model.py FAILED

Failures: [1]: time : 2023-03-23_15:36:49 host : i-0B9E876C rank : 1 (local_rank: 1) exitcode : 1 (pid: 51864) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-03-23_15:36:49 host : i-0B9E876C rank : 0 (local_rank: 0) exitcode : 1 (pid: 51863) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

No response

rabeisabigfool avatar Mar 23 '23 07:03 rabeisabigfool

The port might have been occupied. Can you try running with a different port number?

JThh avatar Mar 23 '23 09:03 JThh

The port might have been occupied. Can you try running with a different port number?

Ok, I'll give it a try.Thank you.

rabeisabigfool avatar Mar 24 '23 01:03 rabeisabigfool

The port might have been occupied. Can you try running with a different port number?

Sorry, what do you mean by occupied port here?

rabeisabigfool avatar Mar 24 '23 06:03 rabeisabigfool

The port number for which you launch the processes.

JThh avatar Mar 24 '23 07:03 JThh

The port number for which you launch the processes.

I checked and found that none of the four GPU ports were occupied. Why is that? And after changing the port, the same error is still reported.

rabeisabigfool avatar Mar 24 '23 07:03 rabeisabigfool

When using docker env to run, can you append --network=host to your command?

JThh avatar Mar 27 '23 06:03 JThh

same problem

scarydemon2 avatar Mar 31 '23 02:03 scarydemon2

+1

cauyxy avatar Mar 31 '23 13:03 cauyxy

same problem @JThh

akk-123 avatar Apr 02 '23 15:04 akk-123

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


same problem @JThh

Issues-translate-bot avatar Apr 02 '23 15:04 Issues-translate-bot

same problem. It seems that using single node single trainer is fine, but when nproc_per_node > 1, l got the same error.

Honee-W avatar Apr 03 '23 08:04 Honee-W

world_size = int(os.environ["WORLD_SIZE"]) mp.spawn(main_worker, args=(world_size, args), nprocs=world_size) This is my main function to start distributed training, and when calling "spawn", it will pass an index aside from args to the function, in this case is main_worker, which should be defined like this: def main_worker(i, world_size, args): Then set device in main_worker function, and move the model to the device like this: torch.cuda.set_device(i) model = model.to(device) I solved this problem by doing so, hope it can help.

Honee-W avatar Apr 06 '23 01:04 Honee-W

+1

Youly172 avatar Apr 11 '23 15:04 Youly172

Thanks @Honee-W for sharing. I understand the issue better now.

model = model.to(torch.cuda.get_currect_device()) would suffice. Would this be useful for you @Youly172 ?

JThh avatar Apr 14 '23 03:04 JThh

邮件已收到~李巧艳

Youly172 avatar Apr 14 '23 03:04 Youly172

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The mail has been received~ Li Qiaoyan

Issues-translate-bot avatar Apr 14 '23 03:04 Issues-translate-bot

+1

ifromeast avatar Apr 18 '23 11:04 ifromeast

+1

Ozawa333 avatar May 08 '23 14:05 Ozawa333

邮件已收到~李巧艳

Youly172 avatar May 08 '23 14:05 Youly172

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The mail has been received~ Li Qiaoyan

Issues-translate-bot avatar May 08 '23 14:05 Issues-translate-bot

邮件已收到~李巧艳

Youly172 avatar Jun 20 '23 09:06 Youly172

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The mail has been received~ Li Qiaoyan

Issues-translate-bot avatar Jun 20 '23 09:06 Issues-translate-bot

这个问题我也碰到了,怎么解决的呀

ALLISWELL8 avatar Oct 13 '23 08:10 ALLISWELL8

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I also encountered this problem, how did I solve it?

Issues-translate-bot avatar Oct 13 '23 08:10 Issues-translate-bot

邮件已收到~李巧艳

Youly172 avatar Oct 13 '23 08:10 Youly172

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The email has been received~Li Qiaoyan

Issues-translate-bot avatar Oct 13 '23 08:10 Issues-translate-bot

(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh tcp://127.0.0.1:10031, ws:4, rank:0 tcp://127.0.0.1:10031, ws:4, rank:1 tcp://127.0.0.1:10031, ws:4, rank:2 tcp://127.0.0.1:10031, ws:4, rank:3


Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)


{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}


Creating dataset Traceback (most recent call last): File "train.py", line 557, in mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args, config)) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained resolved_vocab_files[file_id] = cached_path( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path output_path = get_from_cache( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache raise ValueError( ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

怎么解决呀?????呜呜呜呜卡好久了 how to deal it?

Yizhichaoai avatar Jul 29 '24 15:07 Yizhichaoai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh tcp://127.0.0.1:10031, ws:4, rank:0 tcp://127.0.0.1:10031, ws:4, rank:1 tcp://127.0.0.1:10031, ws:4, rank:2 tcp://127.0.0.1:10031, ws:4, rank:3


Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)


{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}


Creating dataset Traceback (most recent call last): File "train.py", line 557, in mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args, config)) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained resolved_vocab_files[file_id] = cached_path( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path output_path = get_from_cache( File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache raise ValueError( ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

怎么解决呀?????呜呜呜呜卡好久了 how to deal it?

Issues-translate-bot avatar Jul 29 '24 15:07 Issues-translate-bot

邮件已收到~李巧艳

Youly172 avatar Jul 29 '24 15:07 Youly172

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The email has been received~Li Qiaoyan

Issues-translate-bot avatar Jul 29 '24 15:07 Issues-translate-bot