X3D-Multigrid icon indicating copy to clipboard operation
X3D-Multigrid copied to clipboard

Why eval mode degenerated?

Open WUSHUANGPPP opened this issue 3 years ago • 6 comments

Thanks for your clean implementation! @kkahatapitiya I have two problem to consult you:

  1. I find out the prediction in eval mode always same when I finish training x3d on kinetics-200 dataset. But it's normal if inference in model.train().I failed to find the reason.(base_bn_splits=8 or 1 got same observation, I trained the model in normal way.)
  2. Why some layerx.x.bnx.split_bn.running_var and running_mean keep still alone the whole training process ? image As the chart above, why running_mean and running_var keep same along the whole training process? appreciate it

WUSHUANGPPP avatar Oct 11 '21 16:10 WUSHUANGPPP

During training split_bn parameters (eg: self.split_bn.running_mean.data) inside SubBatchNorm will be updated, and they will be copied to bn parameters (eg: self.bn.running_mean.data) during eval, by running https://github.com/kkahatapitiya/X3D-Multigrid/blob/d63d8fe6210d2b38aa26d71b0062b569687d6be2/train_x3d_kinetics_multigrid.py#L205

Are you doing this? If so, things should work properly. Also, what is the batch size per gpu and number of splits you use for bn?

kkahatapitiya avatar Oct 11 '21 16:10 kkahatapitiya

Apprecaite it. I leave out the code you mentioned. I just use your x3d.py and train it according to the common video classification task in my project code, batchsize of 128, 8 gpus without setting any other variables without setting multigrid training details(batchsize 16 per gpu).

#...other backbone...
elif opt.model=='x3d':
        model = x3d3.generate_model('M',n_classes = opt.classes)
#...other backbone...

I just use the interface of generate_model(x3d_version, **kwargs) to generate x3d model and then I'd like to modify the backbone to check other training trick.

WUSHUANGPPP avatar Oct 12 '21 03:10 WUSHUANGPPP

So each epoch we have to run x3d.module.aggregate_sub_bn_stats() otherwise bn parameters would be same as the initial value? @kkahatapitiya Is there any other configureation like this?

WUSHUANGPPP avatar Oct 12 '21 04:10 WUSHUANGPPP

You have to run aggregate_sub_bn_stats(), before validation (i.e., when you put the model in eval() mode) everytime.

kkahatapitiya avatar Oct 12 '21 05:10 kkahatapitiya

Hi,@kkahatapitiya .Could you please tell me if you test the performance in normal training setup( not in multigrid training mode )? I trained and test it in constant batchsize of 128 for 350 epochs on Kinetics-200(smaller dataset of 200-class should get more higher performance),then I got the results of 64.0% acc which is similar to the performance of Resnet18 on this dataset.(I run aggregate_sub_bn_stats() each epoch without validation for fast training) initial lr:0.05 optimizer schedule:cosin decay

WUSHUANGPPP avatar Oct 13 '21 07:10 WUSHUANGPPP

Sorry about the long delay in response. Since the data split and multiple training hyperparameters are different, I am not sure what the expected performance would look like. If you train with the given hyperaparameters and the default K400 split, you'll get a number closer to what's reported.

kkahatapitiya avatar Jul 07 '23 15:07 kkahatapitiya