Open-Sora
Open-Sora copied to clipboard
The reason why crossatten views the batch dimension as 1
Here's the MultiHeadCrossAttention code:
q = self.q_linear(x).view(1, -1, self.num_heads, self.head_dim)
kv = self.kv_linear(cond).view(1, -1, 2, self.num_heads, self.head_dim)
k, v = kv.unbind(2)
attn_bias = None
if mask is not None:
attn_bias = xformers.ops.fmha.BlockDiagonalMask.from_seqlens([N] * B, mask)
x = xformers.ops.memory_efficient_attention(q, k, v, p=self.attn_drop.p, attn_bias=attn_bias)
x = x.view(B, -1, C)
If the batch dimensions of q, k, and v are all placed in the second dimension, with the first dimension remaining unchanged at 1, won't this mess up the information of different videos and images within a batch? I see that the mask in attn_bias carries the batch dimension information. Is it that during efficient computation, the attn_bias parameter controls the computation for different batches? If so, how does it ensure that the information remains organized?