ResNet
ResNet copied to clipboard
Why we can not get the same curves especially on ResNet 34 and ResNet 101 ?
Thank you for your Resnet implementation with MxNet. This is a good example especially for beginner.
I still follow this project and follow the train steps for imagenet. We want to reproduce your results.
However, except ResNet 18, ResNet 34 and ResNet 101 we can not get the same curves. It is shown as below:
we follow the instruction step by step and set the same size of input image and learning rate schedule. Only batch size and GPUs is different. Such as, in ResNet 34, the batch size is 256 and 2 M40 GPUs are utilized; in ResNet 101, the batch size is 96 and 4 M40 GPUs are used. We have executed experiments on cifar 10 with ResNet 164 among different batch size (16, 64 and 128). The performance almost same and even better when smaller batch size is used. So, we don't know why we can not get the same curves. Please tell us more detailed information how to train ResNet 34 or even deeper ResNet from scatch. Thank you very much.
@zhoubinxyz hi, differents gpu has no influence on the training curve,but batch-size influence lots, especially for imanget, which is more bigger than cifar10. i suggest using batch-size in [256, 512].
another, all the training log and detail is in the log directory, you may found more information to check your training. last, do not using the rec quality=80. thanks.
@tornadomeet Thanks for your prompt reply.
We are now training with ResNet-34 with batch-size 512, see if we can get same result. We are using quality=90 according to your instruction. Do you think we need to set quality=100?
no need quality=100, you'd better using 90 or 95(which caffe using, and the default value in MXNet is changed form 80 to 95). another, i'm not sure whether the newest mxnet will influence the training. you may try the mxnet version which in the log description.
@tornadomeet Thanks for your prompt reply.
Ok, we use quality = 90. We just begin train ResNet 34 with batch size 512 on 4 M40 GPUs. Let us to see the error curve whether can get the same curve as yours. This also can give a clue whether different version will influence the training. Thanks!
@bruinxiong any news?
@austingg Based on the suggestion from @tornadomeet , we retrain ResNet-34 with batch-size 512 (256 is adopted at first time). We obtained the similar results. We also try ResNet-50 with the same batch-size and learning rate schedule. The similar results also be obtained. For ResNet-101, because we haven't 8 GPU, we can not get the similar results as @tornadomeet did. So, we are curious the reason that different batch size have large difference on performance. We think there must not be the larger the better. Different dataset has different situation, because for cifar 10, smaller batch size even get better results. So, we have a question, is there any relationship between batch size and the scale of dataset, theoretical ?

In this figure, xxxx-2 is our second time experiment with the suggested batch size.
@bruinxiong thanks. I am also doing some experiments on resnet-18 with batch size 256. 256 is worse than 512. and there is no theoretical relationship between batchsize and dataset size, currently it is still a hyperparameter :sob:
the difference between bigger and smaller batch-size will lie in >=95 epoch, so please train more epochs.
actually, there is some theoretical about batch-size, it relates to gradient variance, ref paper: Coupling Adaptive Batch Sizes with Learning Rates
@bruinxiong, I have trained a renset-18 using this code with batch-size 256, after 4 days training, the result is a little better(30.4% error rate) than 512 batchsize. You may try resnet-34 with batch size 256, but should be patient, bigger batch size converge faster at the begining.
@austingg Thank you for your sharing. If you apply batch-size 256, do you change the learning rate schedule ? Based on many researches (some come from university, some come from industry) suggestions, for batch size, the larger the better, specially for large dataset (maybe it is not strictly and need to be verified.). Maybe randomness is not larger the better. We are training ResNet 101, due to the GPUs resource, we use 4 M40 GPUs only and batch-size 225 not 480, with the similar learning rate schedule as ResNet 50. The curves are shown as below
the purple curve is our third time training result.
@bruinxiong lr decreases at 30, 60, 90, and at 98 remove the imgressive data augmentation.
@bruinxiong lr decreases at 30, 60, 90, and at 98 remove the imgressive data augmentation.
Hi austingg, do you know why removing the imgressive data augmentation in the last few epoches? From the curve it seems improving the result, but why do this?