wide-residual-networks
wide-residual-networks copied to clipboard
Wrong conclusions
Just took another look at https://arxiv.org/pdf/1605.07146v1.pdf
To summarize: • widening consistently improves performance across residual networks of different depth; • increasing both depth and width helps until the number of parameters becomes too high and stronger regularization is needed;
Actually, the first conclusion contradicts the second in a way, but that's not the point.
You may really want to try running tests on MNIST with absolute 0 amount of augmentation as a test on which it's actually easy to overfit. Lesson learnt from MNIST for me was that it's more like there's an optimal width & height which is actually rather low (for such a task). And also that the standart blocks/activation scheduling (standard "preact") may not always be optimal and that groups (like in https://arxiv.org/pdf/1605.06489v1.pdf ) are hugely beneficial at least up to some amount of them.
I was able to achieve .25% peak error rate pretty easily and my best arch was pulling out same peak, but also .26% error through lots of epochs, which was rather hard to get here considering that this is a very high precision already, so the across-epoch fluctuations are relatively high. This was without any parameter smoothing, like moving average.
Cifar performs pretty awfully with that arch, though.