HybrIK icon indicating copy to clipboard operation
HybrIK copied to clipboard

one of the variables needed for gradient computation has been modified by an inplace operation

Open with-twilight opened this issue 2 years ago • 7 comments

Hello, I am very interested in your work. Now I have the following problems: Traceback (most recent call last): File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 375, in main() File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 238, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(opt, cfg)) File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 323, in main_worker loss, acc17 = train(opt, train_loader, m, criterion, optimizer, writer) File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 79, in train loss.backward() File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [768]], which is output 0 of IndexPutBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Looking forward to your recovery!

with-twilight avatar Sep 21 '21 13:09 with-twilight

Hi, could you provide more details? Such as which config file you run.

Jeff-sjtu avatar Oct 08 '21 14:10 Jeff-sjtu

system: ubuntu 16.04 cuda : 10.2 pytorch:1.2 torchvision:0.4 This problem occurs while running ./scripts/train_smpl.sh train_res34 ./configs/256x192_adam_lr1e-3-res34_smpl_3d_base_2x_mix.yaml

I don't know how to solve this problem, but I did not continue to report this error when I re-ran it a few days later. Personal reasons led to the interruption of the program. The code reported the above error after a rerun.

with-twilight avatar Oct 15 '21 03:10 with-twilight

Maybe the problem is the PyTorch I installed. I didn't notice that I was running with another environment yesterday, pytorch:1.9 torchvision:0.3

with-twilight avatar Oct 15 '21 07:10 with-twilight

Hi @with-twilight, it seems the problem is caused by the PyTorch environment. Our code is tested on PyTorch 1.2, so there may be some bugs with PyTorch1.0.

Jeff-sjtu avatar Oct 17 '21 06:10 Jeff-sjtu

Hi @with-twilight, it seems the problem is caused by the PyTorch environment. Our code is tested on PyTorch 1.2, so there may be some bugs with PyTorch1.0.

Hi, thanks for the great work! I also encountered exactly the same problem. pytorch: 1.2.0 torchvision: 0.4.0 cuda: 10.2 python: 3.6

The train_smpl.sh is

EXPID=$1
CONFIG=$2

python ./scripts/train_smpl.py \
    --nThreads 10 \
    --launcher pytorch --rank 0 \
    --dist-url tcp://localhost:23456 \
    --exp-id ${EXPID} \
    --cfg ${CONFIG} --seed 123123

And I try to use 2 GPUs to train, I ran this in the terminal:

CUDA_VISIBLE_DEVICES=2,7 ./scripts/train_smpl.sh train_res34 ./configs/test_config.yaml

The test_config.yaml only change the dataset path and changes the WORLD_SIZE to 2.

Looking forward to your reply!

lulindeng avatar Nov 13 '21 08:11 lulindeng

@lulindeng I met with similar problems before. After I switch to use pytorch==1.6.0, the problem disappears.

biansy000 avatar Nov 22 '21 14:11 biansy000

@lulindeng I met with similar problems before. After I switch to use pytorvh==1.6.0, the problem disappears.

Thank you ! I solved the problem by using revised code in this issue: https://github.com/Jeff-sjtu/HybrIK/issues/35#issuecomment-887304816

lulindeng avatar Nov 22 '21 14:11 lulindeng