CLIP icon indicating copy to clipboard operation
CLIP copied to clipboard

Any plan to add `Swin` transformer?

Open klae01 opened this issue 3 years ago • 2 comments

Swin transformer achieves higher accuracy in model size and computational amount similar to ViT. I think that using clip's method and dataset will show higher performance.

  • ViT-B/16, 384x384, 86M, 55.4Gflops, 77.9 (imagenet 1k acc)
  • Swin-B, 384x384, 88M, 47.0Gflops, 84.5 (imagenet 1k acc)

reference : Swin Transformer table 1 (a)

It would be great if a Swin transformer could be added to compare performance.

Although Resnet is still powerful, I would like to compare whether the performance that was too poor compared to ViT showed that transformer showed overwhelming performance compared to convolution in vision field. Therefore, it would be nice to add famous convolution-based networks such as Efficientnet and Convnext.

klae01 avatar Aug 14 '22 09:08 klae01

It is excactly what I've expected too

celestialxevermore avatar Sep 28 '22 07:09 celestialxevermore

I've expected , too

MaAo avatar Jan 13 '23 16:01 MaAo