TimeSformer-pytorch Discussion on training issues I have encountered

Thank you for the implementation for the paper. This is the first time I'm dealing with transformer model, I tried to train over Kinetics700 dataset using this model. and I just want to share some of the issues I have encountered:

The paper suggested that the model works better with pretrained weights. Although this is a direct extension from image transformer, most of vision transformer's weights should apply directly, there are 2 places that are different:

Positional encoding: we have H x W x T instead of HxW. so I copied the same positional encoding for every frame, sort of like how we inflate imagenet weights on I3D without dividing by T. One alternative way I'm thinking is use angular initialization to generate an 1XT positional encoding and then add to the HxW image positional encoding to form the HxWxT positional encoding.
We are doing two self-attentions now instead of one per block, so there are double amount of weights for qkv and output fcs. For now per block I use the same weights for the first and second self-attention if I use the same number of heads as the pretrained image model. Alternatively, in a different model I have half number of heads so the time attention and spatial attention will each use half of the heads weights.

Since it is the first time I'm dealing with Transformer, I want to reproduce what the paper claimed so I started with the "original" basic vision transformer setup:

12 heads 12 blocks
GELU instead of GEGLU
embedding size 768
Image size 224, divide to 16x16 patches

With this setup, on a V100 GPU we can only squeeze in 4 videos (4x8x3x224x224) for training even with torch.amp , this means if I'm doing an experiment on an p3x8 machine with 4 V100 gpus (~ 12$/h normally), it would take 39 days to do 300 epochs. Of course it may not need to train for 300 epochs, but intuitively, training with batch size = 16 is not usually not optimal.

So alternatively, I tried a new model with 6 heads and 8 blocks, Now I can put 16 videos per GPU, so in total batch size = 64. The model started to train smoothly then training error increases after 7-8 epochs. The training accuracy peaked around 55% and I didn't even bother to run validation because I know clearly it's not working. Below list the relevant configuration I was using.

DATA:
  NUM_FRAMES: 8
  SAMPLING_RATE: 16
  TRAIN_JITTER_SCALES: [256, 320]
  TRAIN_CROP_SIZE: 224
  # TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 1
  TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 3
  INPUT_CHANNEL_NUM: [3]
  DECODING_BACKEND: torchvision
  MEAN: [0.5, 0.5, 0.5]
  STD: [0.5, 0.5, 0.5]
  WEIGHT_DECAY: 0.0
SOLVER:
  BASE_LR: 0.1 # 1 machine
  BASE_LR_SCALE_NUM_SHARDS: True
  LR_POLICY: cosine
  MAX_EPOCH: 300
  WEIGHT_DECAY: 5e-5
  WARMUP_EPOCHS: 35.0
  WARMUP_START_LR: 0.01
  OPTIMIZING_METHOD: sgd
TRANSFORMER:
  TOKEN_DIM: 768
  PATCH_SIZE: 16
  DEPTH: 8
  HEADS: 6
  HEAD_DIM: 64
  FF_DROPOUT: 0.1
  ATTN_DROPOUT: 0.0

So these are the issues are I have encountered for now. I want to share these because hopefully some of you are actually working with video model and maybe we can have a discussion. I think probably my next thing to try is to increase number of depth.

Regards

Mar 26 '21 21:03 zmy1116

The original paper said they use pre-trained ViT, and I think training on K400 or K700 without pre-training weights is not a good choice. I also want to reproduce the results of this work, but many details I can't handle.

I did use pretrained weight from ViT

Mar 27 '21 14:03 zmy1116

Hi, I also meet the question about training acc. I train model using pertrained weight from ViT, but the acc is lower than a baseline model (only ViT backbone, without time attention). I train on 4V100 GPU, so the batch size can be set as 48=32. I randomly initialize the weight of model that not contained in ViT (HWT position embeding, time attention).

Mar 29 '21 09:03 Hanqer

I got lower train/val performance too.

Apr 01 '21 09:04 Tonyfy

@zmy1116 , @Hanqer Can I ask how you two got the pretrained ImageNet weights from ViT? I am training this repo's implementation for a dataset of mine too, but as far as I can tell, there are no pretrained weights available for this repo. Am I wrong?

Apr 03 '21 17:04 RaivoKoot

@zmy1116 , @Hanqer Can I ask how you two got the pretrained ImageNet weights from ViT? I am training this repo's implementation for a dataset of mine too, but as far as I can tell, there are no pretrained weights available for this repo. Am I wrong?

https://github.com/google-research/vision_transformer

Apr 03 '21 19:04 zmy1116

@zmy1116 perhaps you could share/PR the code you used for initializing with pre-trained ViT?

Apr 04 '21 20:04 mckinziebrandon

@zmy1116 perhaps you could share/PR the code you used for initializing with pre-trained ViT?

https://arxiv.org/pdf/2103.15691.pdf This is the google's similar paper, their model 3 is exactly TimeSformer initialize the weights based on section 3.4.

Apr 04 '21 20:04 zmy1116

Yes, I understand that. However, initializing TimeSformer with ViT weights does require some additional code, primarily for mapping the pre-trained weight names to the TimeSformer weight names. If you've already done this, it would be nice of you to share so we can avoid reinventing the wheel.

I'm doing it right now, and will post what I do if you don't want to share.

Apr 04 '21 21:04 mckinziebrandon

Yes, I understand that. However, initializing TimeSformer with ViT weights does require some additional code, primarily for mapping the pre-trained weight names to the TimeSformer weight names. If you've already done this, it would be nice of you to share so we can avoid reinventing the wheel.

I'm doing it right now, and will post what I do if you don't want to share.

please do

Apr 04 '21 21:04 zmy1116

Hm, posts an issue requesting help from others on an open-source project, but won't contribute/share code themselves. Fascinating!

Apr 04 '21 21:04 mckinziebrandon

Hm, posts an issue requesting help from others on an open-source project, but won't contribute/share code themselves. Fascinating!

won't contribute seems harsh, I did share some of my findings, I shared where the original VIT weights are and I did read multiple papers and shared the google paper and showed the section on weights initialization. I would think these count something. I would like to think I did provide more information than most of people who posts issues in this repo.

Apr 04 '21 21:04 zmy1116

I'm doing it right now, and will post what I do if you don't want to share.

@mckinziebrandon That would be great, thanks! I’ve also found a similar implementation which says it includes pretrained weights, but i have not yet tested it myself to see if it works as expected (planning to soon) (https://github.com/m-bain/video-transformers)

Apr 05 '21 00:04 RaivoKoot

I'm still very new to the ML space, but would like to contribute as well. I'd be happy to contribute some documentation if we could hop on a Discord/Slack chat and discuss. Really interested in seeing how this model performs and how I can apply it to some videos. Thanks!

Apr 05 '21 22:04 justinrmiller

@Hanqer @Tonyfy

With multiple rounds of changes and testing, I am able to reproduce similar (not better) result on Kinetics700_2020 with video transformer.

I did the following:

based on google's paper https://arxiv.org/pdf/2103.15691.pdf, I implemented their model 2, their model 3 is exactly TimeSformer and the model 2 use spatial attention and time attention into 2 separate stages. They claim that doing this is better , also it is now a smaller model, we can fit a batch of 64 with 8 v100. The model modification is very straight forward based on this repo.
It is important to find good learning rate and I have to do learning rate warm up, otherwise training diverges from very beginning, for kinetics I use the following

SOLVER:
 BASE_LR: 0.05 # 1 machine
 BASE_LR_SCALE_NUM_SHARDS: True
 LR_POLICY: cosine
 MAX_EPOCH: 30
 WEIGHT_DECAY: 5e-5
 WARMUP_EPOCHS: 1.0
 WARMUP_START_LR: 0.01
 OPTIMIZING_METHOD: sgd

I included color jittering in augmentation, not sure how much this helped

After 30 epochs I'm getting ~62% accuracy on kinetics700_2020 with multi-views. My best model (with X3D-M) on this dataset was ~63.4% with multi-views. I don't think it's good but I don't see any results for this dataset online. The closest public model I could find on this is from SenseTime's lab on K700, they get 64% with multi-views.

So I would say with video transformer i can get a reasonable model , and the training time is around 30h on an 8GPU machine, which I find very interesting.

Apr 06 '21 03:04 zmy1116

@zmy1116 Thanks for sharing the training configuration. I do the same thing that finding a good learning rate with warm up. In my setting, I use lr=0.01 and warm up for 2 epochs, model 3 (TimeSformer). But in Kinetics400, I can't get reasonable results. I will test the configuration you shared. By the way, is your model using 8 frames sampling with sampling rate x32? Have you trained on Kinetics400?

Apr 06 '21 05:04 Hanqer

@zmy1116 Thanks for sharing the training configuration. I do the same thing that finding a good learning rate with warm up. In my setting, I use lr=0.01 and warm up for 2 epochs, model 3 (TimeSformer). But in Kinetics400, I can't get reasonable results. I will test the configuration you shared. By the way, is your model using 8 frames sampling with sampling rate x32? Have you trained on Kinetics400?

num_frames = 32
sampling_rate = 2
target_fps = 30

Also I used the tublet embedding as the paper suggested, I use tublet of size 16x4, so 32x224x224 inputs will become 8x14x14

No i didn't try on k400.... i realized long time ago that since the papers are done on k400, i should do these experiments on k400... but it's a huge pain to download the dataset (need to launch multiple machines to download/ rotate proxies so i don't get banned/ verify corrupted downloads etc...)...

Apr 06 '21 06:04 zmy1116

Thanks for @zmy1116 's sharing. I will share the results on kinetics400 after testing this configuration.

Apr 06 '21 06:04 Hanqer

Thanks @zmy1116. I was able to load the pretrained ViT weights into TimeSformer with the following modifications.

Replace GEGLU with nn.Gelu in the FeedForward implementation.
Replace the to_patch_embedding with the PatchEmbed class in timm and renamed to patch_embed.
Minor reshaping tweaks like self.cls_token = nn.Parameter(torch.randn(1, 1, dim)) (instead of (1, dim)) for compatibility with the ViT model weights.

I used the following regex mapping to go from the ViT weight names to the TimeSformer names:

    mapping = {
        'cls_token': 'timesformer.cls_token',
        'patch_embed\.(.*)': 'timesformer.patch_embed.\1',
        r'blocks\.(\d+).norm1\.(.*)': r'timesformer.layers.\1.1.norm.\2',
        r'blocks\.(\d+).norm2\.(.*)': r'timesformer.layers.\1.2.norm.\2',
        r'blocks\.(\d+).attn\.qkv\.weight': r'timesformer.layers.\1.1.fn.to_qkv.weight',
        r'blocks\.(\d+).attn\.proj\.(.*)': r'timesformer.layers.\1.1.fn.to_out.0.\2',
        r'blocks\.(\d+).mlp\.fc1\.(.*)': r'timesformer.layers.\1.2.fn.net.0.\2',
        r'blocks\.(\d+).mlp\.fc2\.(.*)': r'timesformer.layers.\1.2.fn.net.3.\2',
    }

The pretrained model I used was obtained through the timm library via:

vit_base_patch32_224_in21k = timm.create_model(
    'vit_base_patch32_224_in21k',
    pretrained=True)

Finally, I also tried initializing the temporal attention submodule's weights to zeros, as recommended by the ViViT paper:

            def zero(m):
                if hasattr(m, 'weight') and m.weight is not None:
                    nn.init.zeros_(m.weight)
                if hasattr(m, 'bias') and m.bias is not None:
                    nn.init.zeros_(m.bias)

            for layer in self.timesformer.layers:
                prenorm_temporal_attn: nn.Module = layer[0]
                prenorm_temporal_attn.apply(zero)

Note: I'm using an internal framework so a full copy/paste of my code wouldn't make sense to anyone, but the above description is everything I've tried so far. Still need to tweak/debug more though. After 80 epochs I'm still only getting about 55% validation accuracy on Kinetics-400 (@Hanqer), compared to the 40% I was getting without using pretrained ViT weights.

Also, FWIW I am able to overfit the training data quite easily (no surprise there) and reach nearly 100 percent training accuracy with enough epochs.

Apr 06 '21 14:04 mckinziebrandon

@zmy1116 have you tried using adamw instead of SGD? I'm currently trying variants of

--batch-size                    = 32
--optimizer                     = adamw
--learning-rate                 = 1.25e-4 # this gets scaled by number of replicas in my code
--w-decay                       = 0.1
--num-epochs                    = 100
--clip-gradient-norm            = 2.5

# Warmup schedule
--lr-schedule-logic             = cosine-with-warmup
--lr-end-lr                     = 0.0
--lr-max-epoch                  = 100
--lr-warm-epoch                 = 30
--lr-start-lr                   = 1e-7

For the data, I'm passing doing the same processing as X3D_M.yaml (appears your configs are setup like this). A typical GPU setup is 2 machines with 8 V100 GPUs each.

Apr 06 '21 14:04 mckinziebrandon

@zmy1116 have you tried using adamw instead of SGD? I'm currently trying variants of
--batch-size                    = 32
--optimizer                     = adamw
--learning-rate                 = 1.25e-4 # this gets scaled by number of replicas in my code
--w-decay                       = 0.1
--num-epochs                    = 100
--clip-gradient-norm            = 2.5

# Warmup schedule
--lr-schedule-logic             = cosine-with-warmup
--lr-end-lr                     = 0.0
--lr-max-epoch                  = 100
--lr-warm-epoch                 = 30
--lr-start-lr                   = 1e-7
For the data, I'm passing doing the same processing as X3D_M.yaml (appears your configs are setup like this). A typical GPU setup is 2 machines with 8 V100 GPUs each.

I was not able to get good training with adam/adamw ... but I have not tested with many configurations because of limited resources

Apr 06 '21 17:04 zmy1116

One problem I am trying to figure out is test time inference.

With X3D/SlowFast/I3D, they do the 30 crops:

Do 10 uniform temporal crops
Per temporal crop generate 3 spatial crops: first resize the shorter side to 256 then crop 256x256 from left middle and right

Even they train on 224x224, at inference they can still do 256x256 because it's a conv net and with 3 crops they can cover the entire image

Now with transformer, I train with 224x224 and we can't feed in 256x256 inputs at inference time (unless we modify the network). So for spatial crop I can:

resize the shorter side to 224, do 3 crops, 3 crops will cover the full image
resize the shorter side to 256, do 3 crops: top left, middle, bottom right

You wouldn't think it matters, but to my surprise, the second method actually produce better result, even if we cover more space with first method. I guess it's because during training, I resize the shorter side between 256 to 320 so the model is more used to certain resolution?

Alternatively, I'm thinking about if I should modify the network at test time so the model can take 256x256 inputs:

256X256 produces 16 x 16 tokens of size 16 instead of 14x14, so for the extra two tokens I would just copy the boundary tokens' weights (embedding, positional embedding, spatial transformer related parameters)

This is the first time I work on transformer...I think intuitively it makes sense but I'm not sure.....

Apr 06 '21 17:04 zmy1116

@mckinziebrandon @Hanqer

It looks like in original vision transformer after embedding we do a drop out immediately

    x = AddPositionEmbs(
        inputs,
        inputs_positions=inputs_positions,
        posemb_init=nn.initializers.normal(stddev=0.02),  # from BERT.
        name='posembed_input')
    x = nn.dropout(x, rate=dropout_rate, deterministic=not train)

In this implementation we don't. I'm probably chasing rabbit... I don't think it really matters, but since I can't produce better result I'm looking at everything that may cause the discrepancy..

Apr 06 '21 18:04 zmy1116

Another maybe irrelevant comment:

I'm starting to examine attention weights using https://www.kaggle.com/piantic/vision-transformer-vit-visualize-attention-map

Some examples on spatial attentions:

image (1) image (2) image (4)

Some examples on temporal attentions (lighter frames represent high attention frames): image (5) image (6) image (7)

Apr 08 '21 04:04 zmy1116

@zmy1116 Hi, As for the there crop you mentioned above. I think, eventually, the ConvNet also takes 224x224 input in the training phase. So could you forward the transformer with:

resize the shorter side to 256, do 3 crops: top left, middle, bottom right, and resize the these crops to 224x224.

And for 256x256 input for transformer, it actually should modify the model, especially the positional embedding (need a interpolation to fit the different number of tokens).

Apr 08 '21 04:04 Hanqer

@zmy1116 Hi, As for the there crop you mentioned above. I think, eventually, the ConvNet also takes 224x224 input in the training phase. So could you forward the transformer with:

resize the shorter side to 256, do 3 crops: top left, middle, bottom right, and resize the these crops to 224x224.

And for 256x256 input for transformer, it actually should modify the model, especially the positional embedding (need a interpolation to fit the different number of tokens).

I tried to run transformer trained on 224x224 with inputs 256x256 by extending number of positional embeddings (so instead of 14x14 tokens, I have now 16x16 so we have one round of extra positional embedding over the edge, which I interpolate with nearest neighbor value).. The performance is slightly worse (-0.5%) and the computation time increased a lot..

Apr 08 '21 13:04 zmy1116

Interesting attention plots @zmy1116. I've been tracking plots of gradient norms and the associated weight histograms, and I've observed something odd that appears to be caused by zeroing out the temporal attention block during initialization (as recommended by ViVit model 3): the gradients are extremely low and the weights simply do not budge from their initialized values.

For example, here is the histogram (x-axis is time) of the layer norm and qkv weight:

I've also tried only initializing the attention weights themselves with zeros (qkv and to_out) but that does not resolve the issue. I'm now able to get quite good validation accuracy on kinetics-400 compared to before (~70%) but it appears the model is basically ignoring the temporal attention block. Have you looked at your weight/gradient histograms @zmy1116 ?

Also note that my 70% validation accuracy is single-crop. No aggregation.

Apr 08 '21 14:04 mckinziebrandon

For moving to 256x256, have you seen this snippet in the timm repo for ViT @zmy1116? https://github.com/rwightman/pytorch-image-models/blob/779107b693010934ac87c8cecbeb65796e218488/timm/models/vision_transformer.py#L386

Apr 08 '21 14:04 mckinziebrandon

@mckinziebrandon , ehmmmmmm below list per weight the distribution (in absolute value) for temporal related weights...Indeed they are definitely smaller than spatial related weights..... weights

weights2

Apr 08 '21 17:04 zmy1116

Ah, is this timesformer or model 2 of vivit? I'm confused by your naming distinction of "temporal_encoder_layers" and "spatial_encoder_layers".

Mine is just the single timesformer model, and the layers.\d.0 are the temporal attention sublayer, and layers.\d.1 are the spatial attention sublayer.

Apr 08 '21 17:04 mckinziebrandon

Ah, is this timesformer or model 2 of vivit? I'm confused by your naming distinction of "temporal_encoder_layers" and "spatial_encoder_layers".

this is the model 2 as this is the one I was able to get good result... I have not succeeded with TimeSformer ever

Apr 08 '21 17:04 zmy1116