Open-Sora The reason why crossatten views the batch dimension as 1

The reason why crossatten views the batch dimension as 1

Open oceanogeology opened this issue 8 months ago • 6 comments

Here's the MultiHeadCrossAttention code:

q = self.q_linear(x).view(1, -1, self.num_heads, self.head_dim) 
kv = self.kv_linear(cond).view(1, -1, 2, self.num_heads, self.head_dim)
k, v = kv.unbind(2) 
attn_bias = None
if mask is not None:
    attn_bias = xformers.ops.fmha.BlockDiagonalMask.from_seqlens([N] * B, mask)
x = xformers.ops.memory_efficient_attention(q, k, v, p=self.attn_drop.p, attn_bias=attn_bias)
x = x.view(B, -1, C)

If the batch dimensions of q, k, and v are all placed in the second dimension, with the first dimension remaining unchanged at 1, won't this mess up the information of different videos and images within a batch? I see that the mask in attn_bias carries the batch dimension information. Is it that during efficient computation, the attn_bias parameter controls the computation for different batches? If so, how does it ensure that the information remains organized?

Jul 03 '24 02:07 oceanogeology

Open-Sora Open-Sora copied to clipboard

The reason why crossatten views the batch dimension as 1

Open-Sora
Open-Sora copied to clipboard