Great difficulty reproducing training
Has anyone trained EfficientNet-B0 from scratch on ImageNet and successfully reproduced the results? I used the model in this repo and tried to follow the hyperparameters as closely as I can. I even implemented a modified RMSprop in pytorch that matches its Tensorflow counterpart (there's difference in treatment of epsilon). I used standard preprocessing. My setup is 8 GPUs, each GPU computes a batch of 32 images. The learning rate is properly scaled (0.016).
So far my best effort seems to be quite far below the reported numbers (> 3 points lower in top1 acc). The only difference I can think of is Exponential Model Averaging, which the official Tensorflow repo includes. But I highly doubt that EMA makes such a huge difference.
Are there anything else in the model itself that may change the training dynamics?
Yes. In my own reproducing experiments the best top1 acc merely across 0.70.
I have try to copy every corner of network and experiment setup in this repo. No idea how to improve results.
My reproduce: https://github.com/lukemelas/EfficientNet-PyTorch/issues/81
Hi, do you have solved the question?
My reproduce: #81
Hi, do you have solved the question?
Not Yet.
Read the released code. The authur has many tricks used in training. I try to repeat most of them, and get acc 70.45%. Still far below 76%
Hi can you share your training code of what you tried so far, that could really help
how you use the multiprocess?
Hi @LinxiFan
I'm also trying to reproduce the result : https://github.com/kakaobrain/fast-autoaugment/tree/dev/efficientnet
I got the same result on EfficientNet-b0 after I fixed many-many things including RMSProps and EMA.
(indeed, EMA caused great impact on training...)
But I got slightly poor results on b1-b4. Still trying to bridge the gap.
Hi @LinxiFan
I'm also trying to reproduce the result : https://github.com/kakaobrain/fast-autoaugment/tree/dev/efficientnet
I got the same result on EfficientNet-b0 after I fixed many-many things including RMSProps and EMA.
(indeed, EMA caused great impact on training...)
But I got slightly poor results on b1-b4. Still trying to bridge the gap.
Hi @ildoonet
-
Does EMA mean exponential moving average LR schedule? Do you mean EMA is much better than cosine lr schedule? My experiment on EfficientNet-B0 improves from 75% to 76.9%, after using your RMSProps instead of pytorch SGD. Cosine annealing LR schedule was used.
-
Why did you initialize the mean square of gradient as 1 not 0?
Thank you
@tzm1003306213 could you tell what exactly configuration did you use, that you received 76.9? I set all parameters as defined in paper and got only 73.57 without EMA, 75.37 with EMA.
@tzm1003306213 could you tell what exactly configuration did you use, that you received 76.9? I set all parameters as defined in paper and got only 73.57 without EMA, 75.37 with EMA.
@misadows I use all hyper-parameters given in the paper, got the same result as you with EMA, 76.9 with cosine lr_schedule.
I use torchvision implement, just difficult to find a proper lr. So IMO that's not the problem of this repo but that efficientnet itself is badly overrated anyway. A model which relys on hard working of trainning hyps are a bad one. Besides don't trust too much on academic papers if you can't trust nothing of them. Most of the made-in-academic SOTA model are heavily relying the big datasets like imagenet. Once the datasets become regularly smaller, those models will find their home in the dustbin.