vit-pytorch Anyone tried to train this code with Imagenet from scratch ?

Anyone tried to train this code with Imagenet from scratch ?

Open mingkai-zheng opened this issue 4 years ago • 1 comments

trafficstars

Thanks for the amazing work !!

I follow the hyperparameter described in the original paper, with Adam optimizer, batch size = 4096, lr=3x10−3, weight_decay = 0.3, dropout = 0.1, but it seems that the regularization is too strong and the model can not converge well.

Dec 29 '20 12:12 mingkai-zheng

@KyleZheng1997 Hi Kyle, so there is one gotcha when training attention networks with Adam, and that is we exclude the parameters of the LayerNorm for weight decay. Or you can just turn off weight decay, it usually makes minor difference

Also try a smaller learning rate, perhaps starting with 3e-4 and working your way down with the scheduler

Dec 29 '20 17:12 lucidrains

vit-pytorch vit-pytorch copied to clipboard

Anyone tried to train this code with Imagenet from scratch ?

vit-pytorch
vit-pytorch copied to clipboard