fairseq
fairseq copied to clipboard
checkpoint_wrapper at the first block disabling gradients
🐛 Bug
If I use checkpoint_wrapper at the first block, the gradients of their parameters are None after backward.
Code sample
import os
import sys
import torch
import torch.nn as nn
from fairseq.modules.checkpoint_activations import checkpoint_wrapper
conv = checkpoint_wrapper(nn.Sequential(
nn.Conv1d(1, 1, 1),
nn.Conv1d(1, 1, 1)
))
fc = nn.Linear(1, 1)
x = torch.randn(2, 1, 1)
conv.train()
fc.train()
conv.zero_grad()
fc.zero_grad()
print(f'before forward, conv[0].weight.grad is {conv[0].weight.grad}')
print(f'before foward, fc.weight.grad is {fc.weight.grad}')
y = conv(x)
z = fc(y)
loss = z.mean()
loss.backward()
print(f'after forward, conv[0].weight.grad is {conv[0].weight.grad}')
print(f'after foward, fc.weight.grad is {fc.weight.grad}')
- saving code above to bug.py, run
python bug.py1
- stdout shows
before forward, conv[0].weight.grad is None
before foward, fc.weight.grad is None
after forward, conv[0].weight.grad is None
after foward, fc.weight.grad is tensor([[-0.1495]])
fc.weight.grad may be different, but is not None.
Expected behavior
Gradients of conv block's parameters are not None.
Environment
- fairseq Version (e.g., 1.0 or master): 1.0.0a0+366974d
- PyTorch Version (e.g., 1.0): 1.7.1+cu92
- OS (e.g., Linux): Linux ubuntu 4.15.0-76-generic
- How you installed fairseq (
pip
, source): git clone the repo and then pip install inside the repo - Build command you used (if compiling from source): None
- Python version: 3.8.5
- CUDA/cuDNN version: Not used
- GPU models and configuration: Not used
Although it can be avoided by setting the input.requires_grad=True
or not using checkpoint_wrapper at the first block, I wonder how it happens.
@YingChenlu I have a similar question, do you know how to freeze parameters of the model in fairseq when training. I used both zero_grad and requires_grad = false are not working well…