UER-py 分布式训练时出现的错误 RuntimeError: connect() timed out.

作者您好，我看了您的说明《更多预训练模型》，然后想用您已经给的‘book_review.txt‘数据跑一个非常小的GPT-2的微调我依次运行代码（有4块机子，故world_size我设为4，用两块空余的GPU跑） python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt
--dataset_path dataset.pt --processes_num 8 --target lm python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --output_model_path models/output_model.bin
--config_path models/bert_base_config.json --learning_rate 1e-4
--world_size 4 --gpu_ranks 2 3
--embedding word_pos --encoder transformer --mask causal --target lm

这时候提示报错： #################################################################################### Process SpawnProcess-1: Traceback (most recent call last): File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/haoping/UER-py/uer/trainer.py", line 399, in worker # Initialize multiprocessing distributed training environment. File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/haoping/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/haoping/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) KeyboardInterrupt Traceback (most recent call last): File "pretrain.py", line 118, in main() File "pretrain.py", line 114, in main trainer.train_and_validate(args) File "/home/haoping/UER-py/uer/trainer.py", line 58, in train_and_validate mp.spawn(worker, nprocs=args.ranks_num, args=(args.gpu_ranks, args, model), daemon=False) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/haoping/UER-py/uer/trainer.py", line 400, in worker dist.init_process_group(backend=args.backend, File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.

请问这是为什么？该如何解决？期待您的答复，谢谢！

Jan 21 '21 10:01 Imposingapple

您好 world_size应该为GPU的数量，比如有两台服务器，每台服务器4个GPU，那么 --world_size 8 ，第一台机器（主结点） --gpu_ranks 0 1 2 3 ，第二台机器 --gpu_ranks 4 5 6 7 我没有理解您说的“有4块机子”是有四台服务器还是有四块卡比较细节的问题可以直接加我qq或者qq邮箱联系 [email protected]

Jan 21 '21 12:01 zhezhaoa

谢谢解答！我说的是一台机子，有四块卡，一张卡上是可以运行的。最近实验室卡比较紧张，所以我还没机会再次在多卡上跑，我看了下使用说明我之前下标没有按照0开始，没有设置cuda_is_visible那个设置，有机会试一试看行不行！

Jan 24 '21 02:01 Imposingapple

（以及貌似我申请加您好友您没有通过我哈哈

Jan 24 '21 02:01 Imposingapple

UER-py UER-py copied to clipboard

分布式训练时出现的错误 RuntimeError: connect() timed out.

UER-py
UER-py copied to clipboard