TALLFormer icon indicating copy to clipboard operation
TALLFormer copied to clipboard

ChunkVideoSwin has no gradients

Open SimoLoca opened this issue 1 year ago • 7 comments

Hi, I have a question related to the training of Tallformer. In particular I noticed that when training the backbone, ChunkVideoSwin has gradients set to None (I think this may lead to problems during backward computation). Is it a normal behaviour or is there something wrong?

To test I've inserted here https://github.com/klauscc/TALLFormer/blob/5519140e39095cd87d9b50420bde912975cae9fb/vedatad/models/detectors/mem_single_stage_detector.py#L67

this line of code:

for name, param in self.backbone.named_parameters():
            print("name: ", name, "grad: ", param.grad)

SimoLoca avatar Apr 21 '23 14:04 SimoLoca

Hi SimoLoca, the gradients should be None during forward. You can only get gradients after loss.backward() and before optimizer.zero_grad(). If you want to see the gradients, you can insert your code right after L23: https://github.com/klauscc/TALLFormer/blob/5519140e39095cd87d9b50420bde912975cae9fb/vedacore/hooks/optimizer.py#L23

klauscc avatar Apr 21 '23 17:04 klauscc

Hi @klauscc, thanks for the fast reply. I tried as you have mentioned, precisely after L23 I've inserted:

for name, param in looper.train_engine.model.backbone.named_parameters():
    print("name: ", name, "grad: ", param.grad)
for name, param in looper.train_engine.model.neck.named_parameters():
    print("name: ", name, "grad: ", param.grad)
for name, param in looper.train_engine.model.head.named_parameters():
    print("name: ", name, "grad: ", param.grad)

And interestingly, the backbone has all gradients set to None, while neck and head has gradients. Is this behavior, therefore, correct? And lastly, will this mean that during training the backbone is freezed, and if so how to "unfreeze" it? Thanks so much!

SimoLoca avatar Apr 22 '23 15:04 SimoLoca

Hi @SimoLoca, I did a quick check and the backbone is indeed updated during training:

>>> import torch
>>> s1 = torch.load('epoch_600_weights.pth',map_location="cpu")
>>> s2 = torch.load('epoch_1000_weights.pth',map_location="cpu")
>>> w1 = s1['backbone.layers.2.blocks.16.mlp.fc2.bias']
>>> w2 = s2['backbone.layers.2.blocks.16.mlp.fc2.bias']
>>> torch.allclose(w1,w2)
False
>>> w1[:10]
tensor([ 0.0496,  0.0174,  0.0173, -0.1023,  0.0316,  0.8908, -0.1456, -0.1831,
        -0.3061, -0.3634])
>>> w2[:10]
tensor([ 0.0492,  0.0165,  0.0165, -0.1018,  0.0315,  0.8822, -0.1449, -0.1810,
        -0.3043, -0.3599])
>>>

In the config file: https://github.com/klauscc/TALLFormer/blob/main/configs/trainval/thumos/1.0.0-vswin_b_256x256-12GB.py#L99 the first 2 stages of the backbone is frozen. In Swin-B there are 24 layers, we only tune the last 20 layers (the last two stages). Did you only check the first several parameters?

klauscc avatar Apr 23 '23 18:04 klauscc

I check that following the readme for training the model, without loading checkpoint. And also I did not change the config file. Did I get it wrong the way to check if the backbone's weights are updated?

SimoLoca avatar Apr 23 '23 19:04 SimoLoca

Hi @klauscc, I've resolved the issue. There are no problem with the code, there were some errors on my config file, so forgive me if i disturbed you too much. Just one last question: during the feature extraction phase, might it make sense to use a stride? For example, with a stride of 16, processing frames 0 - 32, then 16 - 48, and so on?

Thanks you so much

SimoLoca avatar Apr 27 '23 15:04 SimoLoca

It's great you figure it out! Yes I believe extracting features with a stride may lead to higher performance. But in this way your computational cost will increase; and you need to make some changes to the backbone code to process frames the same way.

klauscc avatar May 05 '23 21:05 klauscc

Ok, thanks you. So I need to make some changes in SwinTransformer3D or in ChunkVideoSwin?

SimoLoca avatar May 08 '23 08:05 SimoLoca