vit-pytorch icon indicating copy to clipboard operation
vit-pytorch copied to clipboard

Anyone tried to train this code with Imagenet from scratch ?

Open mingkai-zheng opened this issue 4 years ago • 1 comments
trafficstars

Thanks for the amazing work !!

I follow the hyperparameter described in the original paper, with Adam optimizer, batch size = 4096, lr=3x10−3, weight_decay = 0.3, dropout = 0.1, but it seems that the regularization is too strong and the model can not converge well.

mingkai-zheng avatar Dec 29 '20 12:12 mingkai-zheng

@KyleZheng1997 Hi Kyle, so there is one gotcha when training attention networks with Adam, and that is we exclude the parameters of the LayerNorm for weight decay. Or you can just turn off weight decay, it usually makes minor difference

Also try a smaller learning rate, perhaps starting with 3e-4 and working your way down with the scheduler

lucidrains avatar Dec 29 '20 17:12 lucidrains