ResNeXt.pytorch
ResNeXt.pytorch copied to clipboard
Questions about the performances.
Hi,
May I ask your final performance, the curves are a little confusing. I also implement a different version (https://github.com/D-X-Y/ResNeXt), my results are a little bit lower than the official code, about 0.2 for cifar10 and 1.0 for cifar100. I really want to what causes the differences.
And I also try training resnet20,32,44,56 , I'm pretty sure the model archieteture is the same as the official code but even obtain a much lower accuracy.
Would you mind to give me some suggestions?
I am also curious about the training performance. BTW, I need to run the training many times with different hyper-parameters, and running 300 epochs takes days even with four titan X. Did you guys tried use less epochs and other learning rate schedule? Please let me know if you have any suggestions. Thank you.
@D-X-Y On CIFAR-10 it reaches 96.44%, and on CIFAR-100 81.62%. However, I am not keeping the random seed each run, so it sometimes achieves better than the baseline, and sometimes worse.
As for what would be causing a difference of performance, I talked with the author of the original paper, and he told me (he was right) that since I was using batch_size = 128 instead of 256, the lr should be divided by two. I have checked your code and I see not much difference with mine, so could it be just a matter of finding the correct random seed? Is the initialization of the weights exactly the same as in their code?
@wangdelp Using a single TITANX it takes me roughly one day on CIFAR. Which is your batch size and learning rate?
@prlz77 Thanks for your responses. The initialization is the same and I only train on CIFAT-10 once, so maybe the average performance will be better.
There are two versions of the ResNeXt paper, they change the batchsize for CIFAR from 256 to 128 in the Version2.0. I notice that your performance on CIFAR-100 is lower than the original paper about 1 point, do you think this is caused by learning rate and multi-gpu?
@D-X-Y Since the performance in CIFAR-10 is correct, it is difficult to guess what is happening on CIFAR-100. Some possibilities are:
- Running it many times with different random seeds might show there is no difference.
- CUDNN configuration, I don't know if it is the same for the torch and the pytorch implementations.
- As you said, multi-gpu and learning rate could also be an issue.
- I have checked line by line but it could also be a difference between the original implementation and mine. However, I don't know if that explains the difference between the two CIFARs.
btw, take into account that the results I am providing are for the small net! (cardinality 8, widen factor 4) So it gets 0.1 better on CIFAR10 and 0.6 worse on CIFAR100. When I have some time, I will provide multi-run results to see if it is always like this.
@prlz77 I was using batch size 64 since I want to reduce the memory consumption, and distributed among 4 gpus. I am using the default learning rate 0.1 and decay at [0.5, 0.75] * args.epochs, run it with 300 epochs. It sounds like I need two days to complete training on cifar100. Maybe it's due to other lab members are also using GPUs.
Using batchsize 256 would lead to Out of Memory on 12GB GPU. Maybe I should try use 128 batchsize on two gpus.
@wangdelp in my experience, bs=128 distributed on two 1080TI takes about one day. bs=128 on only one gpu takes a little bit more. bs=64 takes almost double the time for the same 300epochs. I would suggest you to use bs=128 (note that if ngpu=4, you will be loading 128/4 for gpu, which is a small amount of memory). If GPUs are already in use, that could be causing a performance issue, as you say. Although it is improvable, check that data is not the issue, for instance increase the number of prefetching threads.
@prlz77 Thank you. Should I use initial lr 0.05 when batchsize=128, and lr 0.025 when batchsize=64?
@wangdelp Exact!
Hi, guys. I have a question about the results reported in the paper. Did they report the median of best test error during training or the median of test error after training?@prlz77 @wangdelp
@Queequeg92 I think it is the median of the best test error during training.
@prlz77 I agree with you since models are likely to be overfitting at the end of training process. I have sent emails to some authors to confirm.
@prlz77 I think Part D of this paper gives the answer.
@D-X-Y @prlz77 I'm faced with the same problem when reproducing the performance of DenseNet-40 on CIFAR100. With the exactly same configuration, the acc of PyTorch version is often 1 point lower than Torch version. I don't think it is caused by random seeds. However, after digging into the implementation details of the two frameworks, I find no differences. I am so confused...
In the past I've noticed up to 1% difference just by using cudnn fastest options due to noise introduced by numerical imprecisions.
@prlz77 I set cudnn.benchmark = True and cudnn.deterministic = True. Is that ok?
@wandering007 maybe with cudnn.deterministic = False you get better results.
@prlz77 No improvements from my experiments. Thank you anyway.
@wandering007 I'm sorry to hear that, I found this behaviour some years ago, maybe the library has changed or noise is not that important in this model.
@wandering007 I'm also confused about the differences between two CIFAR datasets. I have got similar accuracy with Wide-DenseNet on CIFAR10. But on CIFAR100 with exactly the same model and training details, the accuracies are always lower than reported in the paper, about 1%. Do you have any suggestion on that? BTW, I'm using tensorflow.
@boluoweifenda I haven't train it via tensorflow. There are a lot of ways to improve performance if you don't care about the fair comparison, like using dropout, a better lr schedule, better data augmentation. Personally, 1% performance difference between two frameworks is acceptable. BTW,same settings for different frameworks are not very fair itself :-)
@wandering007 Thanks for your reply~ But I just care about the fair comparison. Maybe I need to dig deeply to find the differences between frameworks. However, I got the same accuracy on CIFAR10 using tensorflow. It's quite strange for the accuracy drop on CIFAR100.
(╯°Д°)╯︵┻━┻