ChatGLM-6B Default process group has not been initialized, please make sure to call init_process

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

各位大佬有没有碰到过这个问题，官方代码没改动执行碰到的，网上的解决方案不work

File "/home/work/LLM/chatglm/ptuning/main.py", line 431, in main() File "/home/work/LLM/chatglm/ptuning/main.py", line 370, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/work/LLM/chatglm/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/home/work/LLM/chatglm/ptuning/trainer.py", line 1722, in _inner_training_loop model = self._wrap_model(self.model_wrapped) File "/home/work/LLM/chatglm/ptuning/trainer.py", line 1547, in _wrap_model model = nn.parallel.DistributedDataParallel( File "/home/work/miniconda3/envs/chatglm/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 625, in init self.process_group = _get_default_group() File "/home/work/miniconda3/envs/chatglm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 697, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Expected Behavior

No response

Steps To Reproduce

官网下载最新的chatGLM代码 cd ptuning sh train.sh

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

Jun 02 '23 02:06 ustcxiexk

使用 torchrun 分布式启动

Jun 02 '23 03:06 Youggls

使用 torchrun 分布式启动

感谢大佬回复！能否再具体说明下步骤，要先改造下main.py为分布式，然后执行sh train.sh，再用torchrun分布式启动么？感谢大佬！！

Jun 02 '23 03:06 ustcxiexk

使用 torchrun 分布式启动

感谢大佬回复！能否再具体说明下步骤，要先改造下main.py为分布式，然后执行sh train.sh，再用torchrun分布式启动么？感谢大佬！！

https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/trainer.py#L1532 从官方默认的代码可以看到，只有training arg参数中local_rank!=-1的时候，才会启动ddp 应该是你的某个设置影响了训练参数中的设置，如果你没有多卡运行的需求，尝试手动设置命令行参数--local_rank -1试一下

Jun 02 '23 06:06 Youggls

@ustcxiexk 试试这样： https://github.com/THUDM/ChatGLM-6B/pull/1173/files

Jun 02 '23 06:06 Barbery

@ustcxiexk 试试这样： https://github.com/THUDM/ChatGLM-6B/pull/1173/files

感谢大佬！可以了，太厉害了！

Jun 03 '23 01:06 ustcxiexk

使用 torchrun 分布式启动

感谢大佬回复！能否再具体说明下步骤，要先改造下main.py为分布式，然后执行sh train.sh，再用torchrun分布式启动么？感谢大佬！！

https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/trainer.py#L1532 从官方默认的代码可以看到，只有training arg参数中local_rank!=-1的时候，才会启动ddp 应该是你的某个设置影响了训练参数中的设置，如果你没有多卡运行的需求，尝试手动设置命令行参数--local_rank -1试一下

感谢大佬回复，我按下面老哥提供的方案成功了，训练很慢，昨天下午到今天才训了一半，等完成了我再试试这种方法~

Jun 03 '23 01:06 ustcxiexk

@ustcxiexk 试试这样： https://github.com/THUDM/ChatGLM-6B/pull/1173/files

感谢大佬！可以了，太厉害了！

windows设备要把里面的ncll换成gloo

Jun 14 '23 02:06 yongjieding

ptuning/main.py 59 60 training_args.local_rank = -1 61 增加上面这句

Jul 01 '23 00:07 asmcos

我一开始使用的Transformers==4.30.2也会遇到了这个问题，但是在我改成transformers==4.27.1之后问题就消失了

Oct 30 '23 09:10 Li-jiaxian

ChatGLM-6B ChatGLM-6B copied to clipboard

Default process group has not been initialized, please make sure to call init_process_group

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B
ChatGLM-6B copied to clipboard