X3D-Multigrid
X3D-Multigrid copied to clipboard
Why eval mode degenerated?
Thanks for your clean implementation! @kkahatapitiya I have two problem to consult you:
- I find out the prediction in eval mode always same when I finish training x3d on kinetics-200 dataset. But it's normal if inference in model.train().I failed to find the reason.(base_bn_splits=8 or 1 got same observation, I trained the model in normal way.)
- Why some layerx.x.bnx.split_bn.running_var and running_mean keep still alone the whole training process ?
As the chart above, why running_mean and running_var keep same along the whole training process? appreciate it
During training split_bn parameters (eg: self.split_bn.running_mean.data) inside SubBatchNorm will be updated, and they will be copied to bn parameters (eg: self.bn.running_mean.data) during eval, by running https://github.com/kkahatapitiya/X3D-Multigrid/blob/d63d8fe6210d2b38aa26d71b0062b569687d6be2/train_x3d_kinetics_multigrid.py#L205
Are you doing this? If so, things should work properly. Also, what is the batch size per gpu and number of splits you use for bn?
Apprecaite it. I leave out the code you mentioned. I just use your x3d.py and train it according to the common video classification task in my project code, batchsize of 128, 8 gpus without setting any other variables without setting multigrid training details(batchsize 16 per gpu).
#...other backbone...
elif opt.model=='x3d':
model = x3d3.generate_model('M',n_classes = opt.classes)
#...other backbone...
I just use the interface of generate_model(x3d_version, **kwargs) to generate x3d model and then I'd like to modify the backbone to check other training trick.
So each epoch we have to run x3d.module.aggregate_sub_bn_stats() otherwise bn parameters would be same as the initial value? @kkahatapitiya Is there any other configureation like this?
You have to run aggregate_sub_bn_stats(), before validation (i.e., when you put the model in eval() mode) everytime.
Hi,@kkahatapitiya .Could you please tell me if you test the performance in normal training setup( not in multigrid training mode )? I trained and test it in constant batchsize of 128 for 350 epochs on Kinetics-200(smaller dataset of 200-class should get more higher performance),then I got the results of 64.0% acc which is similar to the performance of Resnet18 on this dataset.(I run aggregate_sub_bn_stats() each epoch without validation for fast training) initial lr:0.05 optimizer schedule:cosin decay
Sorry about the long delay in response. Since the data split and multiple training hyperparameters are different, I am not sure what the expected performance would look like. If you train with the given hyperaparameters and the default K400 split, you'll get a number closer to what's reported.