wide-residual-networks Data

Convolving over whitened, or just about anyhow prepossessed data (if the original data is not also fed) seemed like a bad idea to me, so I tried using the standard CIFAR-100 dataset simply scaled to [0..1]. And when I'm getting results that are even worse than you've published on your data, quite possibly, because of some differences in LR decrese functions and the like, I'm seeing results better than yours on on the usual dataset. For example, I'm seeing below 30% avg for 40|1 case for which you've published almost 31% and for which I'm getting around 32%.

Have you even tried training on a standard data? I'm personally seeing around 2 point difference on 40|1 on Cifar-100 test. Considering that most of your results were in 2 point range and that's about the difference between 10m and 50m parameters in your nets, it makes a hell of a difference.

On my own architectures your preprocessing was even more destructive at least in many cases. Also, I've looked a bit for any mentions of preprocessing in the first paper (preact paper) you linked to where you tried to rationalize why you did it, couldn't find any mentions of whitening or any such preprocessing nor in it nor in https://arxiv.org/pdf/1512.03385v1.pdf that they've linked to when they've said they use data like in this paper.

Also, in http://cs231n.github.io/neural-networks-2/#datapre whitening of images is discouraged, though they don't talk about that particular type of of whitening.

Sep 21 '16 14:09 ibmua

@ibmua whitening doesn't seem to give different results from meanstd normalization on CIFAR for networks with batch normalization

Sep 23 '16 11:09 szagoruyko

Well, it gives much different results compared to just scaling to [0..1], that's what I'll tell you. I'm training 28-12 at the moment on the non-distorted dataset and the training hasn't yet finished, but I'm seeing below 19% error already. And that makes sense. I've previously trained a slightly modified grouped structure - kind of like WRN-40-12, but with 80m parameters. It had 18.3% error on CIFAR-100. From what it looks, with your data it would take like 300m parameters to achieve that kind of score.

Try. If you won't see a difference, I'll send you my code and data.

A quick teaser

{ epoch : 144 test_acc : 81.1 loss : 0.0051724427212507 train_acc : 99.997996794872 lr : 0.0024408188379711 train_time : 256.00531888008 test_time : 17.025609016418 n_parameters : 52587060 train_loss : 0.0051724427212507 } { optnet_optimize : true num_classes : 100 generate_graph : false learningRate : 0.0024408188379711 init_value : 10 randomcrop : 4 sequences : 1 epoch_step : 1 randomcrop_type : "reflection" learningRateDecayRatio : 0.992 model : "load" save : "logs/load_7942550" dampening : 0 weightDecay : 0.0005 shortcutType : "A" nesterov : true cudnn_deterministic : false depth : 28 nGPU : 1 multiply_input_factor : 1 dataset : "./datasets/cifar100_combined.t7" max_epoch : 1000 momentum : 0.9 optimMethod : "sgd" widen_factor : 12 hflip : true imageSize : 32 dropout : 0 learningRateDecay : 0.0001 data_type : "torch.CudaTensor" batchSize : 128 }

Don't look at the #epoch, as it wasn't trained continuously, had to pause/resume it due to need to switch OS.

Edit: that 18.9 actually turned out to be the lowest error from the whole training.

Sep 23 '16 11:09 ibmua

@ibmua interesting, thanks for sharing! is this new state-of-the-art on CIFAR-100?

Sep 24 '16 08:09 szagoruyko

18.3% error on CIFAR-100 (for 80m params)? Haven't seen any lower numbers published anywhere. Yes, the general WRN architecture seems good and tweakable. Though, with what I've seen, there might be a bunch of architectures that perform similarly with similar # of params. For example, I've got 19.5% for 24m params with my own other non-residual architecture which seems comparable.. At least while we don't yet have the right numbers for WRN.

It's very good that you've published your results with # of params. Another thing you might want to add is # of multiply-add operations. IMHO these things are missed a lot in 99% of papers, including those by InceptionNet authors. You can nearly always just pump up any old model with more parameters and flops, probably, make some slight changes to prevent getting too many parameters in the last layers and you're done, you've got the best model. But the best architecture is the one that has the highest efficiency.

If you'll be retraining and you want to actually get as close to the best result achievable at the moment, you might want to first take a look at https://arxiv.org/pdf/1605.06489v1.pdf for grouped convs, http://arxiv.org/pdf/1604.04112.pdf for ELU+ResNet as a proof-of-concept and better yet http://arxiv.org/pdf/1605.09332v2.pdf https://github.com/ltrottier/pelu.resnet.torch for PELU. And try max pooling instead of striding, because due to fast Winograd kernels strided convs make no sense anymore. For the groups you'd have to modify utils.lua to use cudnn directly, because it's the only thing in Torch with groups, though it's still relatively poor. But you know that already =). I'll upload my code, so you don't have to do that yourself. You might also want to try using mini-batches of 64.

You already did a lot of the work anyway testing different block types. Also it's not like you have to run tests for whole 5 times each. I wonder what variance/std you had on your results during tests, though, pretty interesting info to add to your article, you only told the mean (which is a common thing to do, of course). And I guess, you can extrapolate insights from the numbers you already got on the newer net.

If you'll want to go with another net altogether, you might want to try keeping the stem not too wide, while driving the 3x3 convs fat. I think that might prevent some overfAtting.

Sep 24 '16 19:09 ibmua

Here you go. https://github.com/ibmua/Breaking-Cifar I've uploaded the data as well, but you can try to download it via source, just for us to be sure that nothing's wrong. You can check my scripts to find how I've modified utils.lua and possibly some other stuff to use CuDNN module directly without intermediate nn module in order to use groups. I also modified train.lua to autosave models during training and to use a different LR decrease function.

Sep 24 '16 23:09 ibmua

Sorry for misclicks =X

I first thought you've built on top of https://github.com/facebook/fb.resnet.torch , but now I see that their code is pretty different. I wonder what did you build this code from?

Sep 25 '16 00:09 ibmua

Got 5.8% error with 40|1 on CIFAR-10 for which you've published 6.85%. With a sort of inexactly mean and std adjusted data. Going to try to run it without any such adjustments also for comparison.

Sep 25 '16 06:09 ibmua

Okay, so it looks as though adjusting the mean and std (per dataset taking info from training set) actually helps. So the numbers should be even better than the ones I told about at the beginning.

Sep 25 '16 19:09 ibmua

Ran another test and got 5.9% with 40|1 on that inexactly mean and std adjusted data. And I'm getting 6.1-6.3% on [0..1] data.

Sep 26 '16 02:09 ibmua

I wonder how to make your code work in the cudaHalf space.

Sep 26 '16 04:09 ibmua

5.3% 82-1 1.2m params 4.9% 160-1 2.4m params on that test.

Sep 26 '16 14:09 ibmua

@ibmua thanks. I did a few tests with [0,1] scaling and no mean/std normalization, it indeed improves results, although couldn't match your numbers exactly. for fp16 see https://github.com/szagoruyko/wide-residual-networks/tree/fp16 branch

Sep 26 '16 16:09 szagoruyko

Well, the numbers may depend on an init and hyperparams, like LR scheduling, which is quite different. And possibly also on data preprocessing. I will be testing to see if per-channel adjustments which are used in fb.resnet.torch are actually better than per-dataset kind of like I tried. It may be due to init, or even something else but I just got a little bit worse results for 2 different tests which I considered to be pretty weird, counting in the fact that the adjustments that seemed to perform better were inexact -- I just subtracted and divided whole datasets by some pretty harsh numbers - not the exact mean and std.

I wonder what results exactly did you get? =)

Also, note that adjusting mean and std definitely provides benefit over my previous [0..1].

Here's my script for pre-processing per-channel kind of like in fb.resnet.torch. You can process that [0..1] dataset you have with this thing.

cifar10 = torch.load("./wide-residual-networks/datasets/cifar10.t7")

function prepare( d , result )
    m = torch.mean( torch.reshape(d , (#d)[1], 3,32*32), 3)
    s = torch.std ( torch.reshape(d , (#d)[1], 3,32*32), 3)

    m = torch.reshape( torch.mean( m, 1) , 3 )
    s = torch.reshape( torch.mean( s, 1) , 3 )
    print(#m)
    print(m)
    print(#s)
    print(s)

    for i=1, (#result)[1] do
        result[i][1]:csub( m[1] )
        result[i][2]:csub( m[2] )
        result[i][3]:csub( m[3] )

        result[i][1]:div ( s[1] )
        result[i][2]:div ( s[2] )
        result[i][3]:div ( s[3] )
        end
    end

prepare( cifar10.trainData.data , cifar10.testData.data  )
prepare( cifar10.trainData.data , cifar10.trainData.data )

torch.save("./wide-residual-networks/datasets/cifar10_std.t7", cifar10)

Sep 26 '16 17:09 ibmua

After rerunning the tests you might want to contact the many paper authors that quoted you with those results. Including http://arxiv.org/pdf/1608.06993v1.pdf

Interestingly, I guess you might get even more paper mentions with those poor results as more people will publish them for comparison to make a point that their archs are better. ☜(ˆ▿ˆc)

Sep 26 '16 20:09 ibmua

Also tested with adjusting std, but not mean and it gave about the same result as [0..1]. IMHO, fact that adjusting the mean actually helps, not hurts, amazes me. Counting in the fact that we don't use biases and that we are using ReLU it seemingly should have made the net more prone to errors caused by differences in lighting in my opinion. There's probably something terribly wrong with either the way that we design nets, or with CIFAR-10 that I've tested this on. Poor NN design seems like a more likely case to me.

Sep 26 '16 21:09 ibmua

I have an idea as to how that can probably be fixed. Somebody already did something like that some time in the past http://arxiv.org/pdf/1606.02228v2.pdf -> image preprocessing mini network. Gonna try. Activation structures are my interest. https://github.com/ibmua/testing-activations

Sep 26 '16 21:09 ibmua

I have just got 5.6% in WRN-40-1 through simply scaling to [-1..1], LOL. This is better than what I've had with any other type of this thing.

But does not seem reliable in terms of variance of results depending on init.

Sep 27 '16 01:09 ibmua

Lol, so I've tried using ResNet to learn the best preprocessing and it basically decided that leaving it as it is, maybe, just tweaking the mean, is the best way to be. Lol. I don't belive that so I'm going to keep trying =D Edit: using lower LR seems to have worked.

Sep 27 '16 04:09 ibmua

@ibmua I've got 19.1% with WRN-28-10-dropout on [0,1] data and 18.5% with WRN-40-10-dropout, should update the paper indeed.

Sep 27 '16 14:09 szagoruyko

You may also want to try out that mean and std preprocessing I've posted, or more simply.


cifar10 = torch.load("/path/cifar.t7")

function prepare( d , res )
    m = torch.mean( d )
    s = torch.std ( d )

    print(m)
    print(s)

    res:csub( m )
    res:div ( s )

    m = torch.mean( res )
    s = torch.std ( res )

    print(m)
    print(s)
    print('')
    end

prepare( cifar10.trainData.data , cifar10.testData.data  )
prepare( cifar10.trainData.data , cifar10.trainData.data )

torch.save("/path/cifar100_general_mean_std.t7", cifar10)

Regarding my results, btw, I never used Dropout to obtain them. But yeah, it's probably beneficial.

Sep 27 '16 17:09 ibmua

Hi, @szagoruyko , Could you please let me know exact how do you do mean/std process? @ibmua 's code at here https://github.com/ibmua/Breaking-Cifar computes different mean and std with fb-resnet-torch provided at here https://github.com/facebook/fb.resnet.torch/blob/e8fb31378fd8dc188836cf1a7c62b609eb4fd50a/datasets/cifar10.lua. It is because this code computes mean and std first crossing spatial locations and then data.

Sep 30 '16 04:09 zizhaozhang

Yes, code in those both places is per-channel. You can also find it above in this thread. I've tried benchmarking different approaches and haven't noticed any distinctive difference. It would take a hell lot of benchmarks to calculate any statistical difference, I think. As for a theoretical view, I'm a bit surprised that this actually gets better results than if one doesn't adjust the mean. But.. it's some mathematical SGD-related thing, I guess. In the spirit of BatchNorm. Or, maybe, something else, but I don't yet understand what.

Use any of these.

Sep 30 '16 06:09 ibmua

I see. I tested your code. The error rate decreased. Thanks @ibmua

Sep 30 '16 14:09 zizhaozhang

@szagoruyko , when you'll be publishing an update, please include a figure with dots mapped on accuracy and #parameters (computation time in another figure). That would help. Also, maybe, a figure of depth|width with dots of accuracy as a color saturation, or something like that.

Another interesting thing would be to see tests with only one resnet subnetwork widened, last one being primary target. But that would take some additional testing time, of course.

Sep 30 '16 18:09 ibmua

@szagoruyko You might also want to try to disable padding, or at least set it to zero-padding instead of reflect-padding, as it will be easier to compare your models with those of others then. I wonder if zero-padding will, maybe, also perform better than reflect-padding?

Oct 01 '16 12:10 ibmua

wide-residual-networks wide-residual-networks copied to clipboard

Data

wide-residual-networks
wide-residual-networks copied to clipboard