vit-pytorch
vit-pytorch copied to clipboard
Anyone tried to train this code with Imagenet from scratch ?
trafficstars
Thanks for the amazing work !!
I follow the hyperparameter described in the original paper, with Adam optimizer, batch size = 4096, lr=3x10−3, weight_decay = 0.3, dropout = 0.1, but it seems that the regularization is too strong and the model can not converge well.
@KyleZheng1997 Hi Kyle, so there is one gotcha when training attention networks with Adam, and that is we exclude the parameters of the LayerNorm for weight decay. Or you can just turn off weight decay, it usually makes minor difference
Also try a smaller learning rate, perhaps starting with 3e-4 and working your way down with the scheduler