UER-py
UER-py copied to clipboard
分布式训练时出现的错误 RuntimeError: connect() timed out.
作者您好,
我看了您的说明《更多预训练模型》,然后想用您已经给的‘book_review.txt‘数据跑一个非常小的GPT-2的微调
我依次运行代码(有4块机子,故world_size我设为4,用两块空余的GPU跑)
python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt
--dataset_path dataset.pt --processes_num 8 --target lm
python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --output_model_path models/output_model.bin
--config_path models/bert_base_config.json --learning_rate 1e-4
--world_size 4 --gpu_ranks 2 3
--embedding word_pos --encoder transformer --mask causal --target lm
这时候提示报错: #################################################################################### Process SpawnProcess-1: Traceback (most recent call last): File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/haoping/UER-py/uer/trainer.py", line 399, in worker # Initialize multiprocessing distributed training environment. File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/haoping/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/haoping/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
KeyboardInterrupt
Traceback (most recent call last):
File "pretrain.py", line 118, in
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/haoping/UER-py/uer/trainer.py", line 400, in worker dist.init_process_group(backend=args.backend, File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/haoping/anaconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.
请问这是为什么?该如何解决? 期待您的答复,谢谢!
您好 world_size应该为GPU的数量,比如有两台服务器,每台服务器4个GPU,那么 --world_size 8 ,第一台机器(主结点) --gpu_ranks 0 1 2 3 ,第二台机器 --gpu_ranks 4 5 6 7 我没有理解您说的“有4块机子”是有四台服务器还是有四块卡 比较细节的问题可以直接加我qq或者qq邮箱联系 [email protected]
谢谢解答!我说的是一台机子,有四块卡,一张卡上是可以运行的。最近实验室卡比较紧张,所以我还没机会再次在多卡上跑,我看了下使用说明我之前下标没有按照0开始,没有设置cuda_is_visible那个设置,有机会试一试看行不行!
(以及貌似我申请加您好友您没有通过我哈哈