lingvo Learnable Align Attention Implementation

Learnable Align Attention Implementation

Open OrcunCanDeniz opened this issue 3 years ago • 1 comments

In the DeepFusion paper it was said that

For each query (i.e., voxel cell), we conduct inner product between the query and the keys to obtain the attention affinity matrix that contains 1 × N correlations between the voxel and all its corresponding N camera features.

So I think this should lead to V x N correlations for V voxel cells and if we consider batches BxVxN. However in the implementation affinity = tf.einsum('bnc,bnc->bn', q, k) produces BxN shaped tensor. I feel like this should be affinity = tf.einsum("bij,bkl->bik",q,k). I couldnt manage to wrap my head around this, what am I missing?

Finally, thanks to the team for this great work. @LiYingwei

Jun 28 '22 22:06 OrcunCanDeniz

It sounds like voxels that they are talking about are in fact pillars with 1 per bev grid, but I'm not 100% sure. Another interesting question is what is the definition of "corresponding N camera features" - do you know which camera points are considered for given lidar feature?

Jul 01 '22 10:07 zlenyk

lingvo lingvo copied to clipboard

Learnable Align Attention Implementation

lingvo
lingvo copied to clipboard