flash-attention icon indicating copy to clipboard operation
flash-attention copied to clipboard

How to fix this "RuntimeError: cu_seqlens_q must have shape (batch_size + 1)"

Open rikeilong opened this issue 1 year ago • 13 comments
trafficstars

c3b25cd5076ff528f31b200768714a9

rikeilong avatar Dec 28 '23 11:12 rikeilong

Can you check if cu_seqlens_q has shape (batch_size + 1)?

tridao avatar Dec 28 '23 18:12 tridao

1703817893059 I wonder if it's an error with this var?

rikeilong avatar Dec 29 '23 02:12 rikeilong

Can you add a short script to reproduce the error?

tridao avatar Dec 29 '23 02:12 tridao

1703820046706 I tried to copy the above code here, but it cause an error.

rikeilong avatar Dec 29 '23 03:12 rikeilong

I'm wondering if manually adding a column to batch size is the right solution. I have run it successfully before this error was reported.

rikeilong avatar Dec 29 '23 03:12 rikeilong

image I dont know what is going on? Why does this error occur? Is it because my data is wrong?

rikeilong avatar Dec 29 '23 04:12 rikeilong

I solved this problem. I tried lowering transformer 4.35 to 4.33 it works.

rikeilong avatar Dec 29 '23 05:12 rikeilong

I have attempted to utilize the BartFlashAttention class from the Transformers library with BioGPT, because BioGPT's attention mechanism is derived from the BART model's attention module. However, I encountered the following error: RuntimeError: cu_seqlens_q must have shape (batch_size + 1)

ahorazahedi avatar Feb 14 '24 21:02 ahorazahedi

I have attempted to utilize the BartFlashAttention class from the Transformers library with BioGPT, because BioGPT's attention mechanism is derived from the BART model's attention module. However, I encountered the following error: RuntimeError: cu_seqlens_q must have shape (batch_size + 1)

Idk how transformers or BioGPT do things, you might get better help if you ask there.

tridao avatar Feb 14 '24 21:02 tridao

Thank you for your response. I've identified the problem: the transformer creates a causal-like attention mask for BioGPT. However, your FlashAttention expects the attention mask in the shape of [bs, seq_len] and requires a 'causal' parameter. So, I must pass an attention mask similar to what a tokenizer would generate to FlashAttention. Am I correct about the required shape for the attention mask?"

ahorazahedi avatar Feb 14 '24 22:02 ahorazahedi

We don't take attention mask currently. You can see the README or function docstring for the required shapes.

tridao avatar Feb 14 '24 23:02 tridao

was there any final fix for this issue or just downgrading transformer to lower version was the only option?

amitagh avatar Apr 02 '24 10:04 amitagh

def _get_unpad_data(attention_mask):
    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32).flatten()
    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
    max_seqlen_in_batch = seqlens_in_batch.max().item()
    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch.flatten(), dim=0, dtype=torch.int32), (1, 0))
    return (
        indices,
        cu_seqlens,
        max_seqlen_in_batch,
    )

i flatten seqlens_in_batch fixed this bug when use flash_attn_varlen_func but i see this not faster than use attiontion_mask=None with flash_attn_func

rin2401 avatar Apr 10 '24 10:04 rin2401