torch-residual-networks
                                
                                
                                
                                    torch-residual-networks copied to clipboard
                            
                            
                            
                        BatchNorm after ReLU
Hi,
I am performing somehow similar benchmark, but on caffenet128 (and moving to ResNets now) on ImageNet. One thing, that I have found - the best position of BN in non-res net is after ReLU and without scale+bias layer (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md):
| Name | Accuracy | LogLoss | Comments | 
|---|---|---|---|
| Before | 0.474 | 2.35 | As in paper | 
| Before + scale&bias layer | 0.478 | 2.33 | As in paper | 
| After | 0.499 | 2.21 | |
| After + scale&bias layer | 0.493 | 2.24 | 
May be, it is worth testing too.
Second, results on CIFAR-10 often contradicts results on ImageNet. I.e., leaky ReLU > ReLU on CIFAR, but worse on ImageNet.
P.S. We could cooperate in ImageNet testing, if you agree.
Oh, interesting! I'll add a link to this issue in the README, if you don't mind.
What is the 'scale&bias layer'? In Torch, batch normalization layers have learnable weight and bias parameters that correspond with β,γ in the Batch Norm paper. Is that what you mean?
Yes, β and γ. In caffe BatchNorm is split into batchnorm layer and learnable affine params layer.
On Imagenet, @ducha-aiki found the opposite effect from the CIFAR results above. Putting batch normalization after the residual layer seems to improve results on Imagenet.
That is not correct, I have done batchnorm experiments on plain, non-residual nets only so far :) The batchnorm ResNets are in training. And the "ThinResNet-101" from my benchmark does not use batchnorm at all - as baseline.
Oh I guess I misunderstood, pardon. So this experiment was on an ordinary Caffenet, not a residual network?
Yes.
Thanks, that makes sense. It's interesting because it challenges the commonly-held assumption that batch norm before ReLU is better than after. I'd be interested to see how much of an impact the residual network architecture has on ImageNet---the harder the task, the more of an effect different architectures seem to have.
commonly-held assumption that batch norm before ReLU is better than after.
I never understand this from original paper, because sense of data whitening is normalization of layer input, and ReLU output is usually input for next layer.
@ducha-aiki The paper reads:
In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian", normalizing it is likely to produce activation with a stable distribution.
I get from this that its better to batch-normalize the linear function since its more likely to behave like a normal distribution (from which the method is derived), especially for cases like the ReLU function which is asymmetric.