volo icon indicating copy to clipboard operation
volo copied to clipboard

intuition behind Outlook Attention Generation seems does not make sense

Open toodle opened this issue 3 years ago • 13 comments

Hi,

Thanks for your work.

As your claimed, the generated $W_A$ can work as the weight to aggregate the local context.

However, $W_A$ is generated by a Linear operation along the channel dimension, which indicates the Receptive Field is 1. The neighboring context can not be perceived during the $W_A$ generation process.

Thus, how can $W_A$ encode the relationship information?

toodle avatar Jun 26 '21 03:06 toodle

Hi, your doubt is reasonable. However, there are some explanations to that. First, the input are forwarded into a PatchEmbed, so the Receptive Field after the PatchEmbed is patchsize, and therefore inside W_A is encoded with neighborhood context. For that you can refer to Involution (https://arxiv.org/abs/2103.06255) for detials. By the way you can also refer to #1 to join the discussion about the difference between Incolution and the Outlokk attention.

wuyongfa-genius avatar Jun 26 '21 04:06 wuyongfa-genius

Hi,

The local context aggregation is conducted in the value projection operation. The attention weights are generated by a linear but an unfold operation is operated on the value tensor. Check that in Eq. 4 in the paper.

houqb avatar Jun 26 '21 04:06 houqb

Hi thank you for your paper and congrats on SOTA.

I have a question related to this, from the linear projection we generate an attention map for each of pixel with in the local neighborhood. In the fold operation we are summing these weighted averages so is it not sufficient to use a singular K^2 weighting across the entire neighborhood followed by a softmax? There should only be minor differences because the current method is taking softmax multiple times and The other only does it once right?

thank you

monney avatar Jun 26 '21 05:06 monney

@monney I believe that the K is the Kernel_Size, so there are K^2 tokens in a KxK window, therefore there needs a (K^2)^2 attention weight to model all-pair relationships as in the standard self-attention. >_<

wuyongfa-genius avatar Jun 26 '21 05:06 wuyongfa-genius

Thank you for your reply. @Andrew-Qibin

The attention weights are generated by a linear but an unfold operation is operated on the value tensor.

My concern is the attention weights. They represent the importance of the local context when doing aggregation. So, they should natually have seen the local context. As you said, they are generated by a linear operation. They never get access to local context. How can they know the importance.

toodle avatar Jun 26 '21 06:06 toodle

Hi thank you for your paper and congrats on SOTA.

I have a question related to this, from the linear projection we generate an attention map for each of pixel with in the local neighborhood. In the fold operation we are summing these weighted averages so is it not sufficient to use a singular K^2 weighting across the entire neighborhood followed by a softmax? There should only be minor differences because the current method is taking softmax multiple times and The other only does it once right?

thank you

@monney has given the correct answer. By the way, another advantage of our outlooker is that you can introduce stride in unfold which allows us to run faster with no sacrifice on accuracy performance.

houqb avatar Jun 26 '21 06:06 houqb

@Andrew-Qibin I think the main question here is that since the attention matrix does not use similarity scores in its generation, the attention scores are only based on the central pixel.

My question is related, since this is summed to a single position i,j it should be sufficient to have a single attention matrix across the entire patch instead of one for each pixel. I’m just curious if this was tried or if there’s a reason for generating an attention map for each pixel and then aggregating later.

EDIT: My question was answered in the other thread, we aggregate in all pixels and not just the center, since the fold will sum each pixel's value across windows, meaning that we need to have each pixel modified to have the correct weighted average not simply the center pixel. This would be the case if we only summed over each individual neighborhood.

monney avatar Jun 26 '21 06:06 monney

Agree with @monney

Only central pixel is considered in the attention weight generation process. How can they represent the importance of neighboring context?

toodle avatar Jun 26 '21 07:06 toodle

@toodle I think because of #7 the weights end up being based on the KxK pixels at least indirectly, aggregating from 2K-1x2K-1 surrounding pixels. It differs from traditional attention though and is definitely more similar to a dynamic conv I think.

monney avatar Jun 26 '21 20:06 monney

@monney Thanks. Agree with your opinon on Dynamic Conv as discussed in https://github.com/sail-sg/volo/issues/5

toodle avatar Jun 27 '21 01:06 toodle

@monney I understand that surrounding pixel can be aggregated bu Unfold operation.

However, my focus is the attention weights becuase attention weights do not get surrouding information.

I admit it works well from the experimental results but the story is not fine.

toodle avatar Jun 27 '21 01:06 toodle

@toodle I agree with you. The attention weights don’t take the similarity between the pixels into account, this to me is a key piece of attention. But the way they are generated the attention maps actually do end up taking into account every pixels value. Specifically since each central pixel generates a map for the surrounding pixels, and you fold across them, it ends up being roughly equivalent to if the attention maps were generated by a fully connected layer across all the KxK pixels. Since you are summing across KxK weighted averages, and each weighted average was generated by a different central pixel.

monney avatar Jun 28 '21 02:06 monney

Does that make sense if operating the unfold operation on the attention tensor to catch some local contexts?

kpmokpmo avatar Jul 06 '21 05:07 kpmokpmo