unsloth
unsloth copied to clipboard
BurstAttention:An Efficient Distributed Attention Framework for Extremely Long Sequences
The experimental results under different lengths demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, especially tensor parallelism (Megatron-V3) with FlashAttention, reducing 40% communication overheads and achieving 2× speedup during training 128K sequence length on 8×A100.
https://arxiv.org/abs/2403.09347 I don't know if this is useful for unsloth. I would like to know what you guys think.
Very interesting!