pytorch-deep-learning
pytorch-deep-learning copied to clipboard
Interpretation of patches for ViT
Imo the "default" first steps for an image going through ViT-B/16 are
- Creating patches with
torch.nn.Unpool
- Doing linear projection with
torch.nn.Linear
The notebook implements a somewhat scaled down hybrid approach which is NOT equivalent. If you check the appendix, their hybrids are ResNet X + ViT Y, meaning they're taking the output feature maps from a ResNet as inputs for a ViT