pytorch-deep-learning icon indicating copy to clipboard operation
pytorch-deep-learning copied to clipboard

Interpretation of patches for ViT

Open nick-konovalchuk opened this issue 1 year ago • 0 comments

Imo the "default" first steps for an image going through ViT-B/16 are

  1. Creating patches with torch.nn.Unpool
  2. Doing linear projection with torch.nn.Linear

The notebook implements a somewhat scaled down hybrid approach which is NOT equivalent. If you check the appendix, their hybrids are ResNet X + ViT Y, meaning they're taking the output feature maps from a ResNet as inputs for a ViT

nick-konovalchuk avatar Nov 15 '23 22:11 nick-konovalchuk