Phil Wang
Phil Wang
Hi James, attention excels in the regime of big data, as shown in the paper. However, I am curious why fine tuning did not work. Are you using Ross' model?...
@JamesQFreeman I think fine-tuning from a pretrained model should generally work well. maybe you should raise the issue with him
@JamesQFreeman ohh... well, I think I spot the error, your learning rate is way too high `1e-2`, try Karpathy's favorite LR, `3e-4`
@Erichen911 1200 is not enough! Off by 3 orders of magnitude at least!
@Erichen911 I would recommend getting a huge amount of images, preferrably a million at least, and then doing self-supervised learning with BYOL, before training on your tiny training set otherwise,...
@liberbey Hey Ahmet! One of pitfalls of transformers is having settings that result in the dimension per head to be too small. The dimension per head should be at least...
@liberbey depth should be at a minimum of 6
@liberbey I think the only option is to get a bunch of unlabelled images (in the millions) and do self-supervised learning with BYOL before fine-tuning on your dataset. Transformers only...
@SuX97 @liberbey well, there's been a new development, you two should try https://github.com/lucidrains/vit-pytorch#distillation
you'll both still need at least a million images... haha