MnasNet-pytorch-pretrained icon indicating copy to clipboard operation
MnasNet-pytorch-pretrained copied to clipboard

Architecture discussions

Open snakers4 opened this issue 7 years ago • 11 comments

Thanks for this repo! I managed to obtain ~40-45% tops, looks like you could achieve ~69%.

From the major architecture differences I noticed only RELU6. Did it boost accuracy, or it is just inherited from MobileNet?

Starting from lr 0.1, and decayed to its 0.5 every 20 epochs.

This would also point me to using open AI adamW. This is more or less a continuous version of your training regime. Would be interesting if you tried it. Also it converges quite quickly.

256 batchsize with 2 K80 GPU.

There is some evidence, that for such models batch-size of 1000-2000 is preferable =(

snakers4 avatar Dec 27 '18 02:12 snakers4

OpenAI AdamW

snakers4 avatar Dec 27 '18 02:12 snakers4

Hi snakers4, thx for your advice.

I have checked your repo before. In terms of why it is better: if I just use SGD, it already achieved 65% top1. The reason might be RELU. And during my training of Mnasnet, I found that the representation power is a little bit weak since the training loss is higher than the testing loss, so I change the dropout rate from default 0.5 to 0.0, which indeed boosted the performance to 68%.

I have also tried adam and rmsprop, but they just cannot converge in my case.

billhhh avatar Dec 27 '18 03:12 billhhh

if I just use SGD, it already achieved 65% top1

You mean just using my model with SGD or your model?

so I change the dropout rate from default 0.5 to 0.0, which indeed boosted the performance to 68%

Interesting, afaik we did not use any dropout at all

I have also tried adam and rmsprop, but they just cannot converge in my case.

Interesting. Well, anyway, just give adamw and a larger batch a try =)

snakers4 avatar Dec 27 '18 03:12 snakers4

Also @Randl trained MobileNet2 with adam and SGD, adam converged 3x faster, but SGD converged only +1 pp better ...

All of this tells me that the newer networks are getting more and more fragile ...

snakers4 avatar Dec 27 '18 03:12 snakers4

My model + SGD

billhhh avatar Dec 27 '18 03:12 billhhh

Agree with you, newer nets should be carefully tuned. Still don't know how the paper get 74%. Maybe large batchsize matters, but currently I may not have that big computation power to do it

billhhh avatar Dec 27 '18 03:12 billhhh

We will see what @Randl will comment, he has more GPUs now afaik

snakers4 avatar Dec 27 '18 03:12 snakers4

@billhhh I use this code ,but the loss is not change,could you help to solve it~

huxianer avatar Dec 27 '18 06:12 huxianer

I've managed to achieve 72+% top-1, however, I also managed to fuck up checkpointing, thus there is no checkpoint (yet).

Randl avatar Dec 29 '18 11:12 Randl

@Randl Wow, that's pretty good result! Did you use 224 input? How about other settings? The same as mine or different?

billhhh avatar Jan 07 '19 04:01 billhhh

nyway, just give adamw and a larger batch a try =)

have you solved the problem? i train the network but the loss did not drop.

xi-mao avatar Dec 08 '21 06:12 xi-mao