ULSAM
ULSAM copied to clipboard
about MaxPool
@Nandan91 @rajatsaini0294 HI For each subspace, the input is HxWxG, through DW + MaxPool + PW, the middle attention map is HxWx1, then through Softmax + Expand, the final attention map is HxWxG.
Because the output dimension of this PW operation is 1, the final attention map is equivalent to one weight shared by all channels. Why use this PW?? Why is it designed so that all channels share one weight?
If this PW operation is removed, that is, treat the output of the MaxPool operation as the final attention map. In this case, it is equivalent to that each point and each channel has its own independent weight. Why not design it this way?
many many thanks!!!
Thanks foralliance for the question
Your question is equivalent to Case 3 in Section 3.2 of the paper. Please refer it.
@rajatsaini0294 Thanks your reply。 You are right. This PW operation is necessary. Only in this way can interaction between channels be guaranteed.
Another question. If use the ordinary convolution whose output dimension is also G to replace the PW, this will not only ensure that each point and each channel has its own independent weight, but also ensure that there is interaction between channels in each group. Not sure if you have tried such a design?
You mean without partitioning the input into G groups, use convolution to generate G output channels and generate G attention maps from that? If I mis-understood, can you explain your idea in detail?
Sorry for not expressing clearly.
My idea is that all the designs are exactly the same as in Figure 2, the only difference is that use the ordinary convolution whose output dimension is also G to replace the original PW.
This replacement can also ensure that there is interaction between channels in each group, that is, capture the cross channel information as you mentioned in Case 3 in Section 3.2. In addition, this replacement can bring an additional effect that each point and each channel has its own independent weight rather than all channels(in group) sharing one weight.
I understood your point. We have not tried this design because this will increase the number of parameters. Surely you can try this and let us know how it worked. :-)