ULSAM icon indicating copy to clipboard operation
ULSAM copied to clipboard

about MaxPool

Open foralliance opened this issue 4 years ago • 5 comments

@Nandan91 @rajatsaini0294 HI For each subspace, the input is HxWxG, through DW + MaxPool + PW, the middle attention map is HxWx1, then through Softmax + Expand, the final attention map is HxWxG.

Because the output dimension of this PW operation is 1, the final attention map is equivalent to one weight shared by all channels. Why use this PW?? Why is it designed so that all channels share one weight?

If this PW operation is removed, that is, treat the output of the MaxPool operation as the final attention map. In this case, it is equivalent to that each point and each channel has its own independent weight. Why not design it this way?

many many thanks!!!

foralliance avatar Dec 20 '20 08:12 foralliance

Thanks foralliance for the question

Your question is equivalent to Case 3 in Section 3.2 of the paper. Please refer it.

rajatsaini0294 avatar Jan 30 '21 10:01 rajatsaini0294

@rajatsaini0294 Thanks your reply。 You are right. This PW operation is necessary. Only in this way can interaction between channels be guaranteed.

Another question. If use the ordinary convolution whose output dimension is also G to replace the PW, this will not only ensure that each point and each channel has its own independent weight, but also ensure that there is interaction between channels in each group. Not sure if you have tried such a design?

foralliance avatar Jan 30 '21 12:01 foralliance

You mean without partitioning the input into G groups, use convolution to generate G output channels and generate G attention maps from that? If I mis-understood, can you explain your idea in detail?

rajatsaini0294 avatar Jan 31 '21 01:01 rajatsaini0294

Sorry for not expressing clearly.

My idea is that all the designs are exactly the same as in Figure 2, the only difference is that use the ordinary convolution whose output dimension is also G to replace the original PW.

This replacement can also ensure that there is interaction between channels in each group, that is, capture the cross channel information as you mentioned in Case 3 in Section 3.2. In addition, this replacement can bring an additional effect that each point and each channel has its own independent weight rather than all channels(in group) sharing one weight.

foralliance avatar Jan 31 '21 04:01 foralliance

I understood your point. We have not tried this design because this will increase the number of parameters. Surely you can try this and let us know how it worked. :-)

rajatsaini0294 avatar Feb 01 '21 11:02 rajatsaini0294