large_vlm_distillation_ood
large_vlm_distillation_ood copied to clipboard
ViT Performance
Hello, I tried out the code, and it is very neat and it works fine. One question I noticed is the ViT-b performance (Table 9) is much worse than the ResNet18 performance. Any idea why is it? Thanks!
ViT is known to suffer significant overfitting under a small training dataset even under strong augmentations, as it does not utilize priors such as translation equivariance in CNN. For it to match similar performance and beat CNNs, we need a dataset at the scale of at least something similar to tiered-imagenet or imagenet.