Swin-Transformer
Swin-Transformer copied to clipboard
Visualize Attentions
Hello, I was thinking about how to produce attention rollouts for Swin Transformer like in ViT. In ViT, the window size is constant so after averaging the attentions in a head, the attn matrices can be multiplied to make the attn rollouts as I understood.
However, in Swin transformer, the window size is changing and there is also a cyclic shift. What kind of roadmap can be followed here to generate attn rollouts? For instance, following Swin-T (2,2,6,2) architecture, would averaging W-MSA and SW-MSA of the same layer along the first dimension, then multiplying the resulting matrices of the following layers make sense?
Any clues?
Probably you can plot the attention maps within the neighboring window for each query point
Has anyone succeeded in visualizing the attention can share?