MedSegDiff RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Traceback (most recent call last): File "segmentation_train.py", line 117, in main() File "segmentation_train.py", line 69, in main TrainLoop( File "E:\MedSegDiff-master2.0\guided_diffusion\train_util.py", line 83, in init self._load_and_sync_parameters() File "E:\MedSegDiff-master2.0\guided_diffusion\train_util.py", line 139, in _load_and_sync_parameters dist_util.sync_params(self.model.parameters()) File "E:\MedSegDiff-master2.0\guided_diffusion\dist_util.py", line 78, in sync_params dist.broadcast(p, 0) File "D:\anaconda3\envs\py38\lib\site-packages\torch\distributed\distributed_c10d.py", line 1438, in wrapper return func(*args, **kwargs) File "D:\anaconda3\envs\py38\lib\site-packages\torch\distributed\distributed_c10d.py", line 1561, in broadcast work.wait() RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

您好，我在Linux系统中进行训练没有任何问题，但是我尝试在Windows系统下进行训练，所有的参数配置都是按照作者在readme中的参数设置，请问为什么会出现上述的错误，请大家帮助解决。

Hello, I have no problem with training in Linux system, but I try to train in Windows system, and all parameters are set according to the author's parameter Settings in readme. May I ask why the above errors occur, please help to solve.

if you want to chat with me: WeChat: DWBSIC

Apr 13 '23 02:04 DWBSIC

我也遇到了这个问题，因为我是windows，dist.init_process_group(backend="gloo", init_method="env://")，所以初始化中的backend首先需要修改，不然就会报ncll错误。然后就遇到了跟您一样的问题，但是这样修改后可以跑通，在dist.broadcast前加入 p = p +0 就可以了。我用的是自己的训练数据，但是现在的问题是GPU 利用率非常低，只能达到六七十的样子。请问您完美解决这个问题了吗

Apr 16 '23 11:04 170744039

It has been solved according to your method. Thank you very much! @170744039

May 07 '23 07:05 xupinggl

@170744039 Thank you for the solution! I have so limited knowledge of DDP that I have no clue how to fix it even though I know its probably a Windows/Linux problem. Could you share more how p = p +0 solves "a leaf Variable that requires grad is being used in an in-place operation"?

May 22 '23 14:05 blackcat1121

我也遇到了这个问题，因为我是windows，dist.init_process_group(backend="gloo", init_method="env://")，所以初始化中的backend首先需要修改，不然就会报ncll错误。然后就遇到了跟您一样的问题，但是这样修改后可以跑通，在dist.broadcast前加入 p = p +0 就可以了。我用的是自己的训练数据，但是现在的问题是GPU 利用率非常低，只能达到六七十的样子。请问您完美解决这个问题了吗

@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.

May 25 '23 13:05 WuJunde

我也遇到了这个问题，因为我是windows，dist.init_process_group(backend="gloo", init_method="env://")，所以初始化中的backend首先需要修改，不然就会报ncll错误。然后就遇到了跟您一样的问题，但是这样修改后可以跑通，在dist.broadcast前加入 p = p +0 就可以了。我用的是自己的训练数据，但是现在的问题是GPU 利用率非常低，只能达到六七十的样子。请问您完美解决这个问题了吗

@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.

May 25 '23 13:05 WuJunde

dist_util.py

old

def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: with th.no_grad(): dist.broadcast(p, 0)

new

def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: p=p+0 with th.no_grad(): dist.broadcast(p, 0)

Jan 28 '24 02:01 Longchentong

MedSegDiff MedSegDiff copied to clipboard

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

dist_util.py

old

def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: with th.no_grad(): dist.broadcast(p, 0)

new

MedSegDiff
MedSegDiff copied to clipboard