LongLoRA LongLoRA + Flash Attention 2 causing illigal memory access

LongLoRA + Flash Attention 2 causing illigal memory access

Open ArturNiederfahrenhorst opened this issue 1 year ago • 7 comments

Thanks for providing the LongLoRA forward functions. Your flash-attn/non-flash-attn implementations of SSN show divergent behavior in my case.

For a repro script, please have a look at the issue I opened over at the flash-attention repo: https://github.com/Dao-AILab/flash-attention/issues/670

The one without flash attention works without problems for me. I stepped my way through it and ops and shapes make sense to me. The shift is implemented by rolling there.

The one with flash attention shows weird behaviour. The shift is not just a roll, but we mess with cu_q_lens. The code, to me, looks like it was written with token sequences longer than half of the group size in mind or something like that. For a batch with 4k context length but only 8 unpadded tokens, I end up with cu_q_lens=[ 0, 8, 520, 16]. For smaller group sizes, the 520 in this tensor "shrinks".

Can you please elaborate the calculations or help me to fix this?

Nov 21 '23 23:11 ArturNiederfahrenhorst

LongLoRA LongLoRA copied to clipboard

LongLoRA + Flash Attention 2 causing illigal memory access

LongLoRA
LongLoRA copied to clipboard