SASA icon indicating copy to clipboard operation
SASA copied to clipboard

One of the variables needed for gradient computation has been modified by an inplace operation:

Open moonbucks opened this issue 2 years ago • 2 comments

Hi, I have problems running the code. I installed all the prerequisites successfully but when I try to run the model with single gpu, it returns error and I don't know why.

Traceback (most recent call last):
  File "train.py", line 201, in <module>
    main()
  File "train.py", line 173, in main
    merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch
  File "/mnt/ssd/3ddet/SASA/tools/train_utils/train_utils.py", line 94, in train_model
    dataloader_iter=dataloader_iter
  File "/mnt/ssd/3ddet/SASA/tools/train_utils/train_utils.py", line 41, in train_one_epoch
    loss.backward()
  File "/home/user/.local/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/user/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 1024, 256, 64]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).


I also tried to run with torch.autograd.set_detect_anomaly(True) set, but still returns similar errors. How should I solve the problem?

Thanks

moonbucks avatar Mar 01 '22 00:03 moonbucks

I have the same problem with PyTorch 1.10. It seems like version issue, because higher version of PyTorch no longer supports in-place modification on tensor. I haven't found a solution.

JianKF avatar Apr 04 '22 11:04 JianKF

Hi, I have solved this problem.

Higher version of PyTorch doesn't support in-place operation on Tensor, which means all the relu function should be set to inplace=False, and operation like '+=', '-=' should not be applied in forward process. So I change the original code in line206 SASA/pcdet/ops/pointnet2/pointnet2_batch/pointnet2_modules.py: new_features *= idx_cnt_mask to new_features_clone = new_features.clone() new_features = new_features_clone * idx_cnt_mask And everything goes well.

Regards.

JianKF avatar Apr 05 '22 09:04 JianKF