pytorch-image-models
pytorch-image-models copied to clipboard
[FEATURE] Feature extraction for SWIN Transformer
There are several applications from creators of SWIN Transformer in Object detection and Semantic segmentation. But Implementation a bit different from the original SWIN for image classification (BasicLayer has additional operations before the main part). Are you planning to add this feature extraction part to your version?
Mentioned in #607, yes, plan is to add feature extraction but in a way that's generic for all non-CNN archs (so the various vision transformers and the new MLP-Mixer nets). Have other things to do and I haven't quite figured out the interface wrt to my existing feature helpers for CNNs
Hey @rwightman – once you have a good idea of the interface I'm happy to help with this – I'd like to use it for my experimentation.
One approach for e.g. VIT/DEIT/SWIN would be to change the way the blocks work so that they take and return non-flattened input (e.g. shape B, C, H, W), and flatten/unflatten internally (to B, C, H*W). This would allow the forward method for features_only models to work without change and in particular mean that they could be used as is for unets/segmentation stuff.
is there any update?
https://github.com/open-mmlab/mmdetection/blob/master/mmdet/models/backbones/swin.py#L746
@rwightman It seems that it is difficult to implement FPN for ViT using the same criteria as CNN, but it is achievable such as Swin (4 stage) and CSWin (4 stage). Can we implement these models that with "Stage" first, because we hope to test the performance of some ViT models in downstream tasks through in only one frameworks (eg. timm).
supported on main branch now w/ NHWC output (see #1438 for more)