UER-py icon indicating copy to clipboard operation
UER-py copied to clipboard

分布式训练时出现的错误 RuntimeError: connect() timed out.

Open Imposingapple opened this issue 4 years ago • 3 comments

作者您好, 我看了您的说明《更多预训练模型》,然后想用您已经给的‘book_review.txt‘数据跑一个非常小的GPT-2的微调 我依次运行代码(有4块机子,故world_size我设为4,用两块空余的GPU跑) python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt
--dataset_path dataset.pt --processes_num 8 --target lm python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --output_model_path models/output_model.bin
--config_path models/bert_base_config.json --learning_rate 1e-4
--world_size 4 --gpu_ranks 2 3
--embedding word_pos --encoder transformer --mask causal --target lm

这时候提示报错: #################################################################################### Process SpawnProcess-1: Traceback (most recent call last): File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/haoping/UER-py/uer/trainer.py", line 399, in worker # Initialize multiprocessing distributed training environment. File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/haoping/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/haoping/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) KeyboardInterrupt Traceback (most recent call last): File "pretrain.py", line 118, in main() File "pretrain.py", line 114, in main trainer.train_and_validate(args) File "/home/haoping/UER-py/uer/trainer.py", line 58, in train_and_validate mp.spawn(worker, nprocs=args.ranks_num, args=(args.gpu_ranks, args, model), daemon=False) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/haoping/UER-py/uer/trainer.py", line 400, in worker dist.init_process_group(backend=args.backend, File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.

请问这是为什么?该如何解决? 期待您的答复,谢谢!

Imposingapple avatar Jan 21 '21 10:01 Imposingapple

您好 world_size应该为GPU的数量,比如有两台服务器,每台服务器4个GPU,那么 --world_size 8 ,第一台机器(主结点) --gpu_ranks 0 1 2 3 ,第二台机器 --gpu_ranks 4 5 6 7 我没有理解您说的“有4块机子”是有四台服务器还是有四块卡 比较细节的问题可以直接加我qq或者qq邮箱联系 [email protected]

zhezhaoa avatar Jan 21 '21 12:01 zhezhaoa

谢谢解答!我说的是一台机子,有四块卡,一张卡上是可以运行的。最近实验室卡比较紧张,所以我还没机会再次在多卡上跑,我看了下使用说明我之前下标没有按照0开始,没有设置cuda_is_visible那个设置,有机会试一试看行不行!

Imposingapple avatar Jan 24 '21 02:01 Imposingapple

(以及貌似我申请加您好友您没有通过我哈哈

Imposingapple avatar Jan 24 '21 02:01 Imposingapple