Potentially missing positional encodings in SpatialTransformer

Open ivallesp opened this issue 3 years ago • 1 comments

Hey folks, first of all thanks for all the effort on building this amazing open source community. Here's my two cents, I may be mistaken, but in the spatial transformers we are using attention without positional encodings. Is that correct? The attention does not have any mechanism to know the original order of pixels, may that be impacting performance?

SpatialSelfAttention: https://github.com/CompVis/stable-diffusion/blob/ce05de28194041e030ccfc70c635fe3707cdfc30/ldm/modules/attention.py#L99-L149

SpatialTransformer: https://github.com/CompVis/stable-diffusion/blob/ce05de28194041e030ccfc70c635fe3707cdfc30/ldm/modules/attention.py#L218-L261

BasicTransformerBlock: https://github.com/CompVis/stable-diffusion/blob/ce05de28194041e030ccfc70c635fe3707cdfc30/ldm/modules/attention.py#L196-L215

Dec 05 '22 12:12 ivallesp

Totally agree, seems like without explicit position encoding, the self and cross attention needs to figure out the location of the token from other mechanisms like the output of convolution (distance to border etc.)

Jul 11 '25 20:07 Animadversio