pytorch-auto-drive icon indicating copy to clipboard operation
pytorch-auto-drive copied to clipboard

inplace operation error

Open solidexu opened this issue 2 years ago • 8 comments

when I tested RESA with resnet50, error occurred. Then I tested SCNN resnet50, same issue

python main_landet.py --train --config=./configs/lane_detection/resa/resnet50_culane.py --mixed-precision Loaded torchvision ImageNet pre-trained weights V1. Not using distributed mode cuda Traceback (most recent call last): File "main_landet.py", line 65, in runner.run() File "/home/aaa/pytorch-auto-drive-master/utils/runners/lane_det_trainer.py", line 55, in run scaler.scale(loss).backward() File "/home/aaa/.local/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/aaa/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [5, 128, 36, 100]], which is output 0 of ReluBackward0, is at version 20; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

solidexu avatar Apr 26 '22 14:04 solidexu

@solidexu I don't have spare gpu right now. I will try test it tomorrow.

voldemortX avatar Apr 26 '22 14:04 voldemortX

I don't know why, the issue is solved by commenting out the relu in RESAReducer. It's too STRANGE for me. image

solidexu avatar Apr 27 '22 07:04 solidexu

I don't know why, the issue is solved by commenting out the relu in RESAReducer. It's too STRANGE for me. image

What pytorch version are you using & do you experience this with/without mixed precision?

voldemortX avatar Apr 27 '22 07:04 voldemortX

I use torch 1.10.2. And I have tested with/without mixed precision, same issue.

solidexu avatar Apr 27 '22 09:04 solidexu

@solidexu I don't really have 1.10, but I can start training normally with 1.6.0 (I have only one card so I first changed world_size to 1 and then use only bs 2).

Here is my command:

python main_landet.py --train --config=./configs/lane_detection/resa/resnet50_culane.py --mixed-precision --batch-size=2

voldemortX avatar Apr 27 '22 09:04 voldemortX

Are you running customized code or do you see that error in the current master branch?

voldemortX avatar Apr 27 '22 09:04 voldemortX

In current master branch, I download your new branch three days ago in fact. Commenting out the relu also occur another error during training. I think I can try torch 1.6.0

solidexu avatar Apr 27 '22 12:04 solidexu

try to add a 1*1 conv at the top layer of RESA, it may be helpful

solidexu avatar Jun 15 '22 13:06 solidexu

@solidexu Sorry to disturb, but did you solve this issue by down-grading pytorch? I think it is encountered by others as well.

voldemortX avatar Oct 13 '22 09:10 voldemortX

close by #121

voldemortX avatar Apr 01 '23 03:04 voldemortX