weaver
weaver copied to clipboard
Inplace operation error in gradient computation
I have come across this error when trying to train. After a bit of google searching, it seems that something is being updated during the gradient computation before it should be. I was hoping you could help me locate the error and let me know what I need to fix as I am not too familiar with pytorch. I have made no modifications to the utils/nn/tools.py script.
I have solved the issue by changing all the inplace operations here and here to non-inplace. Essentially the change is var1 *= var2 was changed to var1 = var1*var2. Should this be changed in the code permanently to avoid this error in the future?
Hi @cgsavard -- can you share the pytorch version? I don't seem to be able to reproduce this error in e.g., 1.12.1
.
Yes, this occurred after I installed pytorch in this way "conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge" because I was working on a different GPU (K40) with a version of CUDA >11.6 (11.7). The pytorch version was the stable 1.12.1 and so I think it was actually the newer CUDA version that raised the issue.
I tested CUDA 11.6 + PyTorch 1.12.1 and still cannot reproduce this error. Did you change anything else when the problem got solved?