MedSegDiff
MedSegDiff copied to clipboard
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
Traceback (most recent call last):
File "segmentation_train.py", line 117, in
您好,我在Linux系统中进行训练没有任何问题,但是我尝试在Windows系统下进行训练,所有的参数配置都是按照作者在readme中的参数设置,请问为什么会出现上述的错误,请大家帮助解决。
Hello, I have no problem with training in Linux system, but I try to train in Windows system, and all parameters are set according to the author's parameter Settings in readme. May I ask why the above errors occur, please help to solve.
if you want to chat with me: WeChat: DWBSIC
我也遇到了这个问题,因为我是windows,dist.init_process_group(backend="gloo", init_method="env://"),所以初始化中的backend首先需要修改,不然就会报ncll错误。然后就遇到了跟您一样的问题,但是这样修改后可以跑通,在dist.broadcast前加入 p = p +0 就可以了。我用的是自己的训练数据,但是现在的问题是GPU 利用率非常低,只能达到六七十的样子。请问您完美解决这个问题了吗
It has been solved according to your method. Thank you very much! @170744039
@170744039 Thank you for the solution! I have so limited knowledge of DDP that I have no clue how to fix it even though I know its probably a Windows/Linux problem. Could you share more how p = p +0 solves "a leaf Variable that requires grad is being used in an in-place operation"?
我也遇到了这个问题,因为我是windows,dist.init_process_group(backend="gloo", init_method="env://"),所以初始化中的backend首先需要修改,不然就会报ncll错误。然后就遇到了跟您一样的问题,但是这样修改后可以跑通,在dist.broadcast前加入 p = p +0 就可以了。我用的是自己的训练数据,但是现在的问题是GPU 利用率非常低,只能达到六七十的样子。请问您完美解决这个问题了吗
@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.
我也遇到了这个问题,因为我是windows,dist.init_process_group(backend="gloo", init_method="env://"),所以初始化中的backend首先需要修改,不然就会报ncll错误。然后就遇到了跟您一样的问题,但是这样修改后可以跑通,在dist.broadcast前加入 p = p +0 就可以了。我用的是自己的训练数据,但是现在的问题是GPU 利用率非常低,只能达到六七十的样子。请问您完美解决这个问题了吗
@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.
dist_util.py
old
def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: with th.no_grad(): dist.broadcast(p, 0)
new
def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: p=p+0 with th.no_grad(): dist.broadcast(p, 0)