MedSegDiff icon indicating copy to clipboard operation
MedSegDiff copied to clipboard

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Open DWBSIC opened this issue 1 year ago • 6 comments

Traceback (most recent call last): File "segmentation_train.py", line 117, in main() File "segmentation_train.py", line 69, in main TrainLoop( File "E:\MedSegDiff-master2.0\guided_diffusion\train_util.py", line 83, in init self._load_and_sync_parameters() File "E:\MedSegDiff-master2.0\guided_diffusion\train_util.py", line 139, in _load_and_sync_parameters dist_util.sync_params(self.model.parameters()) File "E:\MedSegDiff-master2.0\guided_diffusion\dist_util.py", line 78, in sync_params dist.broadcast(p, 0) File "D:\anaconda3\envs\py38\lib\site-packages\torch\distributed\distributed_c10d.py", line 1438, in wrapper return func(*args, **kwargs) File "D:\anaconda3\envs\py38\lib\site-packages\torch\distributed\distributed_c10d.py", line 1561, in broadcast work.wait() RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

您好,我在Linux系统中进行训练没有任何问题,但是我尝试在Windows系统下进行训练,所有的参数配置都是按照作者在readme中的参数设置,请问为什么会出现上述的错误,请大家帮助解决。

Hello, I have no problem with training in Linux system, but I try to train in Windows system, and all parameters are set according to the author's parameter Settings in readme. May I ask why the above errors occur, please help to solve.

if you want to chat with me: WeChat: DWBSIC

DWBSIC avatar Apr 13 '23 02:04 DWBSIC

我也遇到了这个问题,因为我是windows,dist.init_process_group(backend="gloo", init_method="env://"),所以初始化中的backend首先需要修改,不然就会报ncll错误。然后就遇到了跟您一样的问题,但是这样修改后可以跑通,在dist.broadcast前加入 p = p +0 就可以了。我用的是自己的训练数据,但是现在的问题是GPU 利用率非常低,只能达到六七十的样子。请问您完美解决这个问题了吗

170744039 avatar Apr 16 '23 11:04 170744039

It has been solved according to your method. Thank you very much! @170744039

xupinggl avatar May 07 '23 07:05 xupinggl

@170744039 Thank you for the solution! I have so limited knowledge of DDP that I have no clue how to fix it even though I know its probably a Windows/Linux problem. Could you share more how p = p +0 solves "a leaf Variable that requires grad is being used in an in-place operation"?

blackcat1121 avatar May 22 '23 14:05 blackcat1121

我也遇到了这个问题,因为我是windows,dist.init_process_group(backend="gloo", init_method="env://"),所以初始化中的backend首先需要修改,不然就会报ncll错误。然后就遇到了跟您一样的问题,但是这样修改后可以跑通,在dist.broadcast前加入 p = p +0 就可以了。我用的是自己的训练数据,但是现在的问题是GPU 利用率非常低,只能达到六七十的样子。请问您完美解决这个问题了吗

@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.

WuJunde avatar May 25 '23 13:05 WuJunde

我也遇到了这个问题,因为我是windows,dist.init_process_group(backend="gloo", init_method="env://"),所以初始化中的backend首先需要修改,不然就会报ncll错误。然后就遇到了跟您一样的问题,但是这样修改后可以跑通,在dist.broadcast前加入 p = p +0 就可以了。我用的是自己的训练数据,但是现在的问题是GPU 利用率非常低,只能达到六七十的样子。请问您完美解决这个问题了吗

@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.

WuJunde avatar May 25 '23 13:05 WuJunde

dist_util.py

old

def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: with th.no_grad(): dist.broadcast(p, 0)

new

def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: p=p+0 with th.no_grad(): dist.broadcast(p, 0)

Longchentong avatar Jan 28 '24 02:01 Longchentong