Any plan to add `Swin` transformer?
Swin transformer achieves higher accuracy in model size and computational amount similar to ViT.
I think that using clip's method and dataset will show higher performance.
- ViT-B/16, 384x384, 86M, 55.4Gflops, 77.9 (imagenet 1k acc)
- Swin-B, 384x384, 88M, 47.0Gflops, 84.5 (imagenet 1k acc)
reference : Swin Transformer table 1 (a)
It would be great if a Swin transformer could be added to compare performance.
Although Resnet is still powerful, I would like to compare whether the performance that was too poor compared to ViT showed that transformer showed overwhelming performance compared to convolution in vision field. Therefore, it would be nice to add famous convolution-based networks such as Efficientnet and Convnext.
It is excactly what I've expected too
I've expected , too