torch-residual-networks BatchNorm after ReLU

Hi,

I am performing somehow similar benchmark, but on caffenet128 (and moving to ResNets now) on ImageNet. One thing, that I have found - the best position of BN in non-res net is after ReLU and without scale+bias layer (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md):

Name	Accuracy	LogLoss	Comments
Before	0.474	2.35	As in paper
Before + scale&bias layer	0.478	2.33	As in paper
After	0.499	2.21
After + scale&bias layer	0.493	2.24

May be, it is worth testing too.

Second, results on CIFAR-10 often contradicts results on ImageNet. I.e., leaky ReLU > ReLU on CIFAR, but worse on ImageNet.

P.S. We could cooperate in ImageNet testing, if you agree.

Jan 31 '16 17:01 ducha-aiki

Oh, interesting! I'll add a link to this issue in the README, if you don't mind.

What is the 'scale&bias layer'? In Torch, batch normalization layers have learnable weight and bias parameters that correspond with β,γ in the Batch Norm paper. Is that what you mean?

Feb 01 '16 14:02 gcr

Yes, β and γ. In caffe BatchNorm is split into batchnorm layer and learnable affine params layer.

Feb 01 '16 15:02 ducha-aiki

On Imagenet, @ducha-aiki found the opposite effect from the CIFAR results above. Putting batch normalization after the residual layer seems to improve results on Imagenet.

That is not correct, I have done batchnorm experiments on plain, non-residual nets only so far :) The batchnorm ResNets are in training. And the "ThinResNet-101" from my benchmark does not use batchnorm at all - as baseline.

Feb 01 '16 15:02 ducha-aiki

Oh I guess I misunderstood, pardon. So this experiment was on an ordinary Caffenet, not a residual network?

Feb 01 '16 15:02 gcr

Yes.

Feb 01 '16 15:02 ducha-aiki

Thanks, that makes sense. It's interesting because it challenges the commonly-held assumption that batch norm before ReLU is better than after. I'd be interested to see how much of an impact the residual network architecture has on ImageNet---the harder the task, the more of an effect different architectures seem to have.

Feb 01 '16 15:02 gcr

commonly-held assumption that batch norm before ReLU is better than after.

I never understand this from original paper, because sense of data whitening is normalization of layer input, and ReLU output is usually input for next layer.

Feb 01 '16 16:02 ducha-aiki

@ducha-aiki The paper reads:

In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian", normalizing it is likely to produce activation with a stable distribution.

I get from this that its better to batch-normalize the linear function since its more likely to behave like a normal distribution (from which the method is derived), especially for cases like the ReLU function which is asymmetric.

Jun 27 '17 21:06 cgarciae

torch-residual-networks torch-residual-networks copied to clipboard

BatchNorm after ReLU

torch-residual-networks
torch-residual-networks copied to clipboard