HybrIK
HybrIK copied to clipboard
one of the variables needed for gradient computation has been modified by an inplace operation
Hello, I am very interested in your work. Now I have the following problems:
Traceback (most recent call last):
File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 375, in
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 323, in main_worker loss, acc17 = train(opt, train_loader, m, criterion, optimizer, writer) File "/home/ubuntu/data/nsga/HybrIK/scripts/train_smpl.py", line 79, in train loss.backward() File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ubuntu/anaconda3/envs/lrwf/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [768]], which is output 0 of IndexPutBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Looking forward to your recovery!
Hi, could you provide more details? Such as which config file you run.
system: ubuntu 16.04 cuda : 10.2 pytorch:1.2 torchvision:0.4 This problem occurs while running ./scripts/train_smpl.sh train_res34 ./configs/256x192_adam_lr1e-3-res34_smpl_3d_base_2x_mix.yaml
I don't know how to solve this problem, but I did not continue to report this error when I re-ran it a few days later. Personal reasons led to the interruption of the program. The code reported the above error after a rerun.
Maybe the problem is the PyTorch I installed. I didn't notice that I was running with another environment yesterday, pytorch:1.9 torchvision:0.3
Hi @with-twilight, it seems the problem is caused by the PyTorch environment. Our code is tested on PyTorch 1.2, so there may be some bugs with PyTorch1.0.
Hi @with-twilight, it seems the problem is caused by the PyTorch environment. Our code is tested on PyTorch 1.2, so there may be some bugs with PyTorch1.0.
Hi, thanks for the great work! I also encountered exactly the same problem. pytorch: 1.2.0 torchvision: 0.4.0 cuda: 10.2 python: 3.6
The train_smpl.sh is
EXPID=$1
CONFIG=$2
python ./scripts/train_smpl.py \
--nThreads 10 \
--launcher pytorch --rank 0 \
--dist-url tcp://localhost:23456 \
--exp-id ${EXPID} \
--cfg ${CONFIG} --seed 123123
And I try to use 2 GPUs to train, I ran this in the terminal:
CUDA_VISIBLE_DEVICES=2,7 ./scripts/train_smpl.sh train_res34 ./configs/test_config.yaml
The test_config.yaml only change the dataset path and changes the WORLD_SIZE to 2.
Looking forward to your reply!
@lulindeng I met with similar problems before. After I switch to use pytorch==1.6.0, the problem disappears.
@lulindeng I met with similar problems before. After I switch to use pytorvh==1.6.0, the problem disappears.
Thank you ! I solved the problem by using revised code in this issue: https://github.com/Jeff-sjtu/HybrIK/issues/35#issuecomment-887304816