Phil Wang
Phil Wang
@EelcoHoogendoorn oh got it! yea, i think there's a couple ways to do depthwise conv, fixing this!
@EelcoHoogendoorn ok, let me know if https://github.com/lucidrains/vit-pytorch/releases/tag/0.16.13 looks good to you
@Tato14 Hi Joan! Seems like the approach came from https://arxiv.org/pdf/2005.00928.pdf I'll have to read it after I get through my queue of papers this week to see how difficult it...
@Tato14 the naive attention map for individual layers is this variable `attn` https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L56
@abeyang00 Hi Abe! So actually ViT already supports images of different sizes, as long as the height and width is divisible by the patch size, and both height and width...
@abeyang00 the image size there is just the max image size along one dimension - on forward, it still calculates the number of tokens and positionally embeds it correctly https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit.py#L107
@hussam789 sounds good! in 0.7.6 you can do ```python import torch from vit_pytorch.t2t import T2TViT from performer_pytorch import Performer performer = Performer( dim = 512, depth = 2, heads =...
You just need the image size to be max(height, width)
https://github.com/lucidrains/TimeSformer-pytorch
Ross framework should allow you to use ViT and download the pretrained models with one command