lingvo
lingvo copied to clipboard
Learnable Align Attention Implementation
In the DeepFusion paper it was said that
For each query (i.e., voxel cell), we conduct inner product between the query and the keys to obtain the attention affinity matrix that contains 1 × N correlations between the voxel and all its corresponding N camera features.
So I think this should lead to V x N correlations for V voxel cells and if we consider batches BxVxN. However in the implementation affinity = tf.einsum('bnc,bnc->bn', q, k) produces BxN shaped tensor. I feel like this should be affinity = tf.einsum("bij,bkl->bik",q,k). I couldnt manage to wrap my head around this, what am I missing?
Finally, thanks to the team for this great work. @LiYingwei
It sounds like voxels that they are talking about are in fact pillars with 1 per bev grid, but I'm not 100% sure. Another interesting question is what is the definition of "corresponding N camera features" - do you know which camera points are considered for given lidar feature?