torch-residual-networks icon indicating copy to clipboard operation
torch-residual-networks copied to clipboard

BatchNorm after ReLU

Open ducha-aiki opened this issue 9 years ago • 8 comments

Hi,

I am performing somehow similar benchmark, but on caffenet128 (and moving to ResNets now) on ImageNet. One thing, that I have found - the best position of BN in non-res net is after ReLU and without scale+bias layer (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md):

Name Accuracy LogLoss Comments
Before 0.474 2.35 As in paper
Before + scale&bias layer 0.478 2.33 As in paper
After 0.499 2.21
After + scale&bias layer 0.493 2.24

May be, it is worth testing too.

Second, results on CIFAR-10 often contradicts results on ImageNet. I.e., leaky ReLU > ReLU on CIFAR, but worse on ImageNet.

P.S. We could cooperate in ImageNet testing, if you agree.

ducha-aiki avatar Jan 31 '16 17:01 ducha-aiki

Oh, interesting! I'll add a link to this issue in the README, if you don't mind.

What is the 'scale&bias layer'? In Torch, batch normalization layers have learnable weight and bias parameters that correspond with β,γ in the Batch Norm paper. Is that what you mean?

gcr avatar Feb 01 '16 14:02 gcr

Yes, β and γ. In caffe BatchNorm is split into batchnorm layer and learnable affine params layer.

ducha-aiki avatar Feb 01 '16 15:02 ducha-aiki

On Imagenet, @ducha-aiki found the opposite effect from the CIFAR results above. Putting batch normalization after the residual layer seems to improve results on Imagenet.

That is not correct, I have done batchnorm experiments on plain, non-residual nets only so far :) The batchnorm ResNets are in training. And the "ThinResNet-101" from my benchmark does not use batchnorm at all - as baseline.

ducha-aiki avatar Feb 01 '16 15:02 ducha-aiki

Oh I guess I misunderstood, pardon. So this experiment was on an ordinary Caffenet, not a residual network?

gcr avatar Feb 01 '16 15:02 gcr

Yes.

ducha-aiki avatar Feb 01 '16 15:02 ducha-aiki

Thanks, that makes sense. It's interesting because it challenges the commonly-held assumption that batch norm before ReLU is better than after. I'd be interested to see how much of an impact the residual network architecture has on ImageNet---the harder the task, the more of an effect different architectures seem to have.

gcr avatar Feb 01 '16 15:02 gcr

commonly-held assumption that batch norm before ReLU is better than after.

I never understand this from original paper, because sense of data whitening is normalization of layer input, and ReLU output is usually input for next layer.

ducha-aiki avatar Feb 01 '16 16:02 ducha-aiki

@ducha-aiki The paper reads:

In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian", normalizing it is likely to produce activation with a stable distribution.

I get from this that its better to batch-normalize the linear function since its more likely to behave like a normal distribution (from which the method is derived), especially for cases like the ReLU function which is asymmetric.

cgarciae avatar Jun 27 '17 21:06 cgarciae