Swin-Transformer icon indicating copy to clipboard operation
Swin-Transformer copied to clipboard

ape(absolute positional embedding) is set False by default, is it OK?

Open feiyangsuo opened this issue 3 years ago • 3 comments

The parameter ape, which stands for adding absolute position embedding to the patch embedding, is set False by default in the config file.

To my knowledge, for transformer models, input should be with positional embeddings to identify where they are. So I wonder if setting ape=False is proper. Does this mean the swin-transformer model built by default is not sensible to the position of each patch? And, if true, would this influence the performance of finetuned model?

feiyangsuo avatar Dec 19 '21 11:12 feiyangsuo

Hi @feiyangsuo I was also coming across the positional embedding usage, they also use relative positional bias: https://github.com/microsoft/Swin-Transformer/blob/eed077f68e0386e8cdff2e1981492699d9c190c0/models/swin_transformer.py#L89

Which is a learnable matrix of the size of a window, that gets added to the attention matrix. I think in this way they can exclude the absolute positional embedding. I general there exist also relative positional embedding schemes which are used in other vision transformers architectures in addition to relative positional bias.

marc345 avatar Dec 20 '21 08:12 marc345

@feiyangsuo Yes, ape is set False by default, where we find ape has no benefit to general vision recognition problem. If you want to use this feature, you can initialize the ape by zero vectors, such that you can directly leverage the pre-trained models.

ancientmooner avatar Dec 20 '21 09:12 ancientmooner

Hi @feiyangsuo I was also coming across the positional embedding usage, they also use relative positional bias:

https://github.com/microsoft/Swin-Transformer/blob/eed077f68e0386e8cdff2e1981492699d9c190c0/models/swin_transformer.py#L89

Which is a learnable matrix of the size of a window, that gets added to the attention matrix. I think in this way they can exclude the absolute positional embedding. I general there exist also relative positional embedding schemes which are used in other vision transformers architectures in addition to relative positional bias.

I got it, thanks. So actually swin transformer has a relative position sensibility rather than absolute, which is somewhat alike convolution.

feiyangsuo avatar Dec 22 '21 13:12 feiyangsuo