Potentially missing positional encodings in SpatialTransformer
Hey folks, first of all thanks for all the effort on building this amazing open source community. Here's my two cents, I may be mistaken, but in the spatial transformers we are using attention without positional encodings. Is that correct? The attention does not have any mechanism to know the original order of pixels, may that be impacting performance?
SpatialSelfAttention: https://github.com/CompVis/stable-diffusion/blob/ce05de28194041e030ccfc70c635fe3707cdfc30/ldm/modules/attention.py#L99-L149
SpatialTransformer: https://github.com/CompVis/stable-diffusion/blob/ce05de28194041e030ccfc70c635fe3707cdfc30/ldm/modules/attention.py#L218-L261
BasicTransformerBlock: https://github.com/CompVis/stable-diffusion/blob/ce05de28194041e030ccfc70c635fe3707cdfc30/ldm/modules/attention.py#L196-L215
Totally agree, seems like without explicit position encoding, the self and cross attention needs to figure out the location of the token from other mechanisms like the output of convolution (distance to border etc.)