ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

Default process group has not been initialized, please make sure to call init_process_group

Open ustcxiexk opened this issue 2 years ago • 9 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

各位大佬有没有碰到过这个问题,官方代码没改动执行碰到的,网上的解决方案不work

File "/home/work/LLM/chatglm/ptuning/main.py", line 431, in main() File "/home/work/LLM/chatglm/ptuning/main.py", line 370, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/work/LLM/chatglm/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/home/work/LLM/chatglm/ptuning/trainer.py", line 1722, in _inner_training_loop model = self._wrap_model(self.model_wrapped) File "/home/work/LLM/chatglm/ptuning/trainer.py", line 1547, in _wrap_model model = nn.parallel.DistributedDataParallel( File "/home/work/miniconda3/envs/chatglm/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 625, in init self.process_group = _get_default_group() File "/home/work/miniconda3/envs/chatglm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 697, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Expected Behavior

No response

Steps To Reproduce

官网下载最新的chatGLM代码 cd ptuning sh train.sh

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

ustcxiexk avatar Jun 02 '23 02:06 ustcxiexk

使用 torchrun 分布式启动

Youggls avatar Jun 02 '23 03:06 Youggls

使用 torchrun 分布式启动

感谢大佬回复! 能否再具体说明下步骤,要先改造下main.py为分布式,然后执行sh train.sh,再用torchrun分布式启动么?感谢大佬!!

ustcxiexk avatar Jun 02 '23 03:06 ustcxiexk

使用 torchrun 分布式启动

感谢大佬回复! 能否再具体说明下步骤,要先改造下main.py为分布式,然后执行sh train.sh,再用torchrun分布式启动么?感谢大佬!!

https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/trainer.py#L1532 从官方默认的代码可以看到,只有training arg参数中local_rank!=-1的时候,才会启动ddp 应该是你的某个设置影响了训练参数中的设置,如果你没有多卡运行的需求,尝试手动设置命令行参数--local_rank -1试一下

Youggls avatar Jun 02 '23 06:06 Youggls

@ustcxiexk 试试这样: https://github.com/THUDM/ChatGLM-6B/pull/1173/files

Barbery avatar Jun 02 '23 06:06 Barbery

@ustcxiexk 试试这样: https://github.com/THUDM/ChatGLM-6B/pull/1173/files

感谢大佬!可以了,太厉害了!

ustcxiexk avatar Jun 03 '23 01:06 ustcxiexk

使用 torchrun 分布式启动

感谢大佬回复! 能否再具体说明下步骤,要先改造下main.py为分布式,然后执行sh train.sh,再用torchrun分布式启动么?感谢大佬!!

https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/trainer.py#L1532 从官方默认的代码可以看到,只有training arg参数中local_rank!=-1的时候,才会启动ddp 应该是你的某个设置影响了训练参数中的设置,如果你没有多卡运行的需求,尝试手动设置命令行参数--local_rank -1试一下

感谢大佬回复,我按下面老哥提供的方案成功了,训练很慢,昨天下午到今天才训了一半,等完成了我再试试这种方法~

ustcxiexk avatar Jun 03 '23 01:06 ustcxiexk

@ustcxiexk 试试这样: https://github.com/THUDM/ChatGLM-6B/pull/1173/files

感谢大佬!可以了,太厉害了!

windows设备 要把里面的ncll换成gloo

yongjieding avatar Jun 14 '23 02:06 yongjieding

ptuning/main.py 59 60 training_args.local_rank = -1 61 增加上面这句

asmcos avatar Jul 01 '23 00:07 asmcos

我一开始使用的Transformers==4.30.2也会遇到了这个问题,但是在我改成transformers==4.27.1之后问题就消失了

Li-jiaxian avatar Oct 30 '23 09:10 Li-jiaxian