VFF icon indicating copy to clipboard operation
VFF copied to clipboard

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead

Open unknowone opened this issue 2 years ago • 2 comments

Hi, thanks for sharing such a good project~ I have some problems when I tried to train with bash tools/scripts/dist_train.sh 2 --cfg_file /public/chenrunze/xyy/VFF-main/tools/cfgs/kitti_models/VFF_PVRCNN.yaml Here is the error:

Traceback (most recent call last):
  File "tools/train.py", line 205, in <module>
    main()
  File "tools/train.py", line 160, in main
    train_model(
  File "/public/chenrunze/xyy/VFF-main/tools/train_utils/train_utils.py", line 88, in train_model
    accumulated_iter = train_one_epoch(
  File "/public/chenrunze/xyy/VFF-main/tools/train_utils/train_utils.py", line 41, in train_one_epoch
    loss.backward()
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: 
[torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the 
backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or 
anywhere later. Good luck!
                                                                                                                                
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 100876 closing signal SIGTERM                               
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 100877) of binary: /public/chenrunze/miniconda3/envs/bevfusion/bin/python3
Traceback (most recent call last):
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-12_16:28:26
  host      : 8265f0d3bcdf
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 100877)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

can u give some advice? lots of thanks!

unknowone avatar Feb 12 '23 08:02 unknowone

I have also encountered this problem. Have you solved it?

0neDawn avatar Mar 26 '23 02:03 0neDawn

I have also encountered this problem. Have you solved it?

I replace RELU by using leakyrelu to solve this problem but this maybe influence the performance of the model, if you have better solution plz tell me ,thanks!

liulin813 avatar Sep 13 '23 13:09 liulin813