DeepSpeedExamples 多机分布式训练，加载模型，报a leaf Variable that requires grad is being used in an in-place operation错误

多机分布式训练，加载模型，报a leaf Variable that requires grad is being used in an in-place operation错误

Open sc-lj opened this issue 1 year ago • 2 comments

使用deepspeed 多机分布式训练，加载opt-1.3b 模型的时候，报a leaf Variable that requires grad is being used in an in-place operation错误

Apr 18 '23 02:04 sc-lj

colorful: Traceback (most recent call last): colorful: File "DeepSpeed-Chat/training/main_sup.py", line 339, in colorful: main() colorful: File "DeepSpeed-Chat/training/main_sup.py", line 286, in main colorful: model, optimizer, _, lr_scheduler = deepspeed.initialize( colorful: File "/home/vocust001/miniconda3/envs/ldm/lib/python3.9/site-packages/deepspeed/init.py", line 156, in initialize colorful: engine = DeepSpeedEngine(args=args, colorful: File "/home/vocust001/miniconda3/envs/ldm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 286, in init colorful: self._configure_distributed_model(model) colorful: File "/home/vocust001/miniconda3/envs/ldm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1087, in _configure_distributed_model colorful: self._broadcast_model() colorful: File "/home/vocust001/miniconda3/envs/ldm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1017, in _broadcast_model colorful: dist.broadcast(p, groups._get_broadcast_src_rank(), group=self.data_parallel_group) colorful: File "/home/vocust001/miniconda3/envs/ldm/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper colorful: return func(*args, **kwargs) colorful: File "/home/vocust001/miniconda3/envs/ldm/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 217, in broadcast colorful: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) colorful: File "/home/vocust001/miniconda3/envs/ldm/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 81, in broadcast colorful: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) colorful: File "/home/vocust001/miniconda3/envs/ldm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1201, in broadcast colorful: work.wait() colorful: RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Apr 18 '23 02:04 sc-lj

Apr 18 '23 04:04 AltenLi

DeepSpeedExamples DeepSpeedExamples copied to clipboard

多机分布式训练，加载模型，报a leaf Variable that requires grad is being used in an in-place operation错误

DeepSpeedExamples
DeepSpeedExamples copied to clipboard