Swin-Transformer
Swin-Transformer copied to clipboard
CLS token
Thank you for the great paper and code repo, super nice idea.
You mention in the paper that you experiment with appending a CLS token and using this to perform classification. I was wondering how you treat this CLS token - does it attend to all patches or just just the patches which fall into its local area (in the swin self attention process)? I also cannot find where this is implemented in code as this would be helpful.
Many thanks, Harry
I also see checks inplace which in my eyes would prevent a cls token being simply prepended e.g. in the forward method of the transformerblock.
https://github.com/microsoft/Swin-Transformer/blob/ad1c947e76791d8623b61d178c715f737748ade8/models/swin_transformer.py#L251
What the authors are talking about is applying a global average pooling layer to the feature map output in the final stage and then using a linear classifier for image classification, the strategy that has the same effect as ViT with [cls] token accuracy.
Thanks for the response; however, what you describe is the standard approach for SWIN. They explicitly say that they tried CLS prepending.