BEVFormer
BEVFormer copied to clipboard
Question about the criteria for using cross-attention
In BEVFormer, when combining Temporal BEV feature maps, the process is to align -> concat -> apply self-attention. Conversely, when merging Surround-view image feature maps into a single BEV feature map, cross-attention is employed.
Alternatively, one might consider using cross-attention to merge the Temporal BEV feature maps, and it seems feasible to use align -> concat -> self-attention when merging the Surround-view image feature maps.
I am curious as to why the approach of BEVFormer was chosen to use aline -> self-attention on Temporal fusion and cross-attention on Surround view fusion- was it simply because the performance was better, or because it's structurally more sensible?