weaver icon indicating copy to clipboard operation
weaver copied to clipboard

Inplace operation error in gradient computation

Open cgsavard opened this issue 2 years ago • 4 comments

Screen Shot 2022-09-28 at 5 53 57 PM

I have come across this error when trying to train. After a bit of google searching, it seems that something is being updated during the gradient computation before it should be. I was hoping you could help me locate the error and let me know what I need to fix as I am not too familiar with pytorch. I have made no modifications to the utils/nn/tools.py script.

cgsavard avatar Sep 28 '22 23:09 cgsavard

I have solved the issue by changing all the inplace operations here and here to non-inplace. Essentially the change is var1 *= var2 was changed to var1 = var1*var2. Should this be changed in the code permanently to avoid this error in the future?

cgsavard avatar Sep 30 '22 00:09 cgsavard

Hi @cgsavard -- can you share the pytorch version? I don't seem to be able to reproduce this error in e.g., 1.12.1.

hqucms avatar Sep 30 '22 17:09 hqucms

Yes, this occurred after I installed pytorch in this way "conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge" because I was working on a different GPU (K40) with a version of CUDA >11.6 (11.7). The pytorch version was the stable 1.12.1 and so I think it was actually the newer CUDA version that raised the issue.

cgsavard avatar Sep 30 '22 17:09 cgsavard

I tested CUDA 11.6 + PyTorch 1.12.1 and still cannot reproduce this error. Did you change anything else when the problem got solved?

hqucms avatar Sep 30 '22 18:09 hqucms