BEVFormer icon indicating copy to clipboard operation
BEVFormer copied to clipboard

Question about the criteria for using cross-attention

Open joonsu0109gh opened this issue 1 year ago • 0 comments

In BEVFormer, when combining Temporal BEV feature maps, the process is to align -> concat -> apply self-attention. Conversely, when merging Surround-view image feature maps into a single BEV feature map, cross-attention is employed.

Alternatively, one might consider using cross-attention to merge the Temporal BEV feature maps, and it seems feasible to use align -> concat -> self-attention when merging the Surround-view image feature maps.

I am curious as to why the approach of BEVFormer was chosen to use aline -> self-attention on Temporal fusion and cross-attention on Surround view fusion- was it simply because the performance was better, or because it's structurally more sensible?

joonsu0109gh avatar Jul 28 '23 07:07 joonsu0109gh