ViT-pytorch
ViT-pytorch copied to clipboard
Why we need to calculate residual connections when visualize attention maps?
Thanks for your great job!
I am curious why we need to calculate residual connections when visualizing attention maps?
I'm curious too! Why do we need this?
Same question, hi @jeonsworld, could you please help to elaborate any specific reason for adding this identity matrix? Much appreciate.
In my opinion, In ViT's transformer module, It has residual connection. So from layer 1 to 12, Actual Attention map which ViT Model use is residual attention map.