VAND-APRIL-GAN icon indicating copy to clipboard operation
VAND-APRIL-GAN copied to clipboard

关于训练时梯度的问题

Open genzhengmiaohong opened this issue 1 year ago • 4 comments

您好,我在修改train.py文件进行网络训练的时候,在最后loss计算梯度的时候出现了如下错误:RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation,请问您知道该问题如何解决吗?我的cuda版本12.2,因此使用requirement.txt中的版本不合适,我先使用了torch2.1.0的版本,之后更换到 2.2.1+cu118版本均会出现该问题。希望您的回复。

genzhengmiaohong avatar Feb 27 '24 08:02 genzhengmiaohong

你解决了吗?我也遇到了这个问题

tangyz213 avatar Feb 29 '24 11:02 tangyz213

Can you provide more detailed error information, please? I need to pinpoint the location of the error.

ByChelsea avatar Mar 01 '24 11:03 ByChelsea

Can you provide more detailed error information, please? I need to pinpoint the location of the error.

Traceback (most recent call last): File "train.py", line 177, in train(args) File "train.py", line 140, in train loss.backward() File "C:\Users\yzc.conda\envs\APRIL_GAN\lib\site-packages\torch_tensor.py", line 522, in backward torch.autograd.backward( File "C:\Users\yzc.conda\envs\APRIL_GAN\lib\site-packages\torch\autograd_init_.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [8, 1369, 768]], which is output 0 of DivBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).


my env: windows11, torch 2.2.2+cu121 In my env, I modified line 122 in train.py to the following and then the error disappeared

patch_tokens[layer] = patch_tokens[layer] / patch_tokens[layer].norm(dim=-1, keepdim=True)

yangzc0214 avatar Apr 07 '24 08:04 yangzc0214

fix it here

oylz avatar Apr 15 '24 06:04 oylz