尺度自适应自我注意力机制

Open 1yuechu1 opened this issue 8 months ago • 1 comments

作者您好：

Efficient implementation equivalent to the following:

attn_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0) if is_causal else attn_mask
attn_mask = attn_mask.masked_fill(not attn_mask, -float('inf')) if attn_mask.dtype==torch.bool else attn_mask
attn_weight = torch.softmax((Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))) + attn_mask, dim=-1)
attn_weight = torch.dropout(attn_weight, dropout_p)
return attn_weight @ V

根据torch源码，在计算attn weight 时attn_mask是与其直接相加的

tau = tau.permute(0, 2, 1)
attn_mask = dist[:, None, :, :] * tau[..., None]  # [B, 8, Q, Q]
if pre_attn_mask is not None:  # for query denoising
    attn_mask[:, :, pre_attn_mask] = float('-inf')
attn_mask = attn_mask.flatten(0, 1)  # [Bx8, Q, Q]

根据Sparse BEV源码，attn_mask对应$\tau D$

但这和论文中的公式不符合，论文公式对应减法，上述的代码对应加法，请问是不是我哪里理解错了？

Jun 23 '24 02:06 1yuechu1

SparseBEV SparseBEV copied to clipboard

尺度自适应自我注意力机制

Efficient implementation equivalent to the following:

SparseBEV
SparseBEV copied to clipboard