Swin-Transformer icon indicating copy to clipboard operation
Swin-Transformer copied to clipboard

CLS token

Open harrygcoppock opened this issue 2 years ago • 3 comments

Thank you for the great paper and code repo, super nice idea.

You mention in the paper that you experiment with appending a CLS token and using this to perform classification. I was wondering how you treat this CLS token - does it attend to all patches or just just the patches which fall into its local area (in the swin self attention process)? I also cannot find where this is implemented in code as this would be helpful.

Many thanks, Harry

harrygcoppock avatar Dec 20 '22 16:12 harrygcoppock

I also see checks inplace which in my eyes would prevent a cls token being simply prepended e.g. in the forward method of the transformerblock.

https://github.com/microsoft/Swin-Transformer/blob/ad1c947e76791d8623b61d178c715f737748ade8/models/swin_transformer.py#L251

harrygcoppock avatar Dec 21 '22 10:12 harrygcoppock

What the authors are talking about is applying a global average pooling layer to the feature map output in the final stage and then using a linear classifier for image classification, the strategy that has the same effect as ViT with [cls] token accuracy.

abueidvchow avatar Nov 04 '23 11:11 abueidvchow

Thanks for the response; however, what you describe is the standard approach for SWIN. They explicitly say that they tried CLS prepending.

harrygcoppock avatar Nov 04 '23 17:11 harrygcoppock