HybrIK one of the variables needed for gradient computation has been modified by an inplace operation

one of the variables needed for gradient computation has been modified by an inplace operation

Open with-twilight opened this issue 2 years ago • 7 comments

Hello, I am very interested in your work. Now I have the following problems: Traceback (most recent call last): File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 375, in main() File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 238, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(opt, cfg)) File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 323, in main_worker loss, acc17 = train(opt, train_loader, m, criterion, optimizer, writer) File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 79, in train loss.backward() File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [768]], which is output 0 of IndexPutBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Looking forward to your recovery！

Sep 21 '21 13:09 with-twilight

Hi, could you provide more details? Such as which config file you run.

Oct 08 '21 14:10 Jeff-sjtu

system: ubuntu 16.04 cuda : 10.2 pytorch:1.2 torchvision:0.4 This problem occurs while running ./scripts/train_smpl.sh train_res34 ./configs/256x192_adam_lr1e-3-res34_smpl_3d_base_2x_mix.yaml

I don't know how to solve this problem, but I did not continue to report this error when I re-ran it a few days later. Personal reasons led to the interruption of the program. The code reported the above error after a rerun.

Oct 15 '21 03:10 with-twilight

Maybe the problem is the PyTorch I installed. I didn't notice that I was running with another environment yesterday, pytorch:1.9 torchvision:0.3

Oct 15 '21 07:10 with-twilight

Hi @with-twilight, it seems the problem is caused by the PyTorch environment. Our code is tested on PyTorch 1.2, so there may be some bugs with PyTorch1.0.

Oct 17 '21 06:10 Jeff-sjtu

Hi @with-twilight, it seems the problem is caused by the PyTorch environment. Our code is tested on PyTorch 1.2, so there may be some bugs with PyTorch1.0.

Hi, thanks for the great work! I also encountered exactly the same problem. pytorch: 1.2.0 torchvision: 0.4.0 cuda: 10.2 python: 3.6

The train_smpl.sh is

EXPID=$1
CONFIG=$2

python ./scripts/train_smpl.py \
    --nThreads 10 \
    --launcher pytorch --rank 0 \
    --dist-url tcp://localhost:23456 \
    --exp-id ${EXPID} \
    --cfg ${CONFIG} --seed 123123

And I try to use 2 GPUs to train, I ran this in the terminal:

CUDA_VISIBLE_DEVICES=2,7 ./scripts/train_smpl.sh train_res34 ./configs/test_config.yaml

The test_config.yaml only change the dataset path and changes the WORLD_SIZE to 2.

Looking forward to your reply!

Nov 13 '21 08:11 lulindeng

@lulindeng I met with similar problems before. After I switch to use pytorch==1.6.0, the problem disappears.

Nov 22 '21 14:11 biansy000

@lulindeng I met with similar problems before. After I switch to use pytorvh==1.6.0, the problem disappears.

Thank you ! I solved the problem by using revised code in this issue: https://github.com/Jeff-sjtu/HybrIK/issues/35#issuecomment-887304816

Nov 22 '21 14:11 lulindeng

HybrIK HybrIK copied to clipboard

one of the variables needed for gradient computation has been modified by an inplace operation

HybrIK
HybrIK copied to clipboard