fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

checkpoint_wrapper at the first block disabling gradients

Open YingChenlu opened this issue 3 years ago • 1 comments

🐛 Bug

If I use checkpoint_wrapper at the first block, the gradients of their parameters are None after backward.

Code sample

import os
import sys
import torch
import torch.nn as nn

from fairseq.modules.checkpoint_activations import checkpoint_wrapper

conv = checkpoint_wrapper(nn.Sequential(
    nn.Conv1d(1, 1, 1),
    nn.Conv1d(1, 1, 1)
))
fc = nn.Linear(1, 1)

x = torch.randn(2, 1, 1)

conv.train()
fc.train()

conv.zero_grad()
fc.zero_grad()

print(f'before forward, conv[0].weight.grad is {conv[0].weight.grad}')
print(f'before foward, fc.weight.grad is {fc.weight.grad}')

y = conv(x)
z = fc(y)
loss = z.mean()
loss.backward()

print(f'after forward, conv[0].weight.grad is {conv[0].weight.grad}')
print(f'after foward, fc.weight.grad is {fc.weight.grad}')
  1. saving code above to bug.py, run python bug.py1
  2. stdout shows
before forward, conv[0].weight.grad is None
before foward, fc.weight.grad is None
after forward, conv[0].weight.grad is None
after foward, fc.weight.grad is tensor([[-0.1495]])

fc.weight.grad may be different, but is not None.

Expected behavior

Gradients of conv block's parameters are not None.

Environment

  • fairseq Version (e.g., 1.0 or master): 1.0.0a0+366974d
  • PyTorch Version (e.g., 1.0): 1.7.1+cu92
  • OS (e.g., Linux): Linux ubuntu 4.15.0-76-generic
  • How you installed fairseq (pip, source): git clone the repo and then pip install inside the repo
  • Build command you used (if compiling from source): None
  • Python version: 3.8.5
  • CUDA/cuDNN version: Not used
  • GPU models and configuration: Not used

Although it can be avoided by setting the input.requires_grad=True or not using checkpoint_wrapper at the first block, I wonder how it happens.

YingChenlu avatar May 31 '21 08:05 YingChenlu

@YingChenlu I have a similar question, do you know how to freeze parameters of the model in fairseq when training. I used both zero_grad and requires_grad = false are not working well…

robotsp avatar Sep 08 '22 14:09 robotsp