unsloth BurstAttention:An Efficient Distributed Attention Framework for Extremely Long Sequences

BurstAttention:An Efficient Distributed Attention Framework for Extremely Long Sequences

Open sorasoras opened this issue 1 year ago • 1 comments

The experimental results under different lengths demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, especially tensor parallelism (Megatron-V3) with FlashAttention, reducing 40% communication overheads and achieving 2× speedup during training 128K sequence length on 8×A100.

https://arxiv.org/abs/2403.09347 I don't know if this is useful for unsloth. I would like to know what you guys think.

Apr 08 '24 17:04 sorasoras

Very interesting!

Apr 09 '24 06:04 danielhanchen

unsloth unsloth copied to clipboard

BurstAttention:An Efficient Distributed Attention Framework for Extremely Long Sequences

unsloth
unsloth copied to clipboard