DeepSpeed
DeepSpeed copied to clipboard
[Q&A] Why Deepspeed Ulysses could support long sequence length?
Sorry to post the question here.
According to the paper, after all_to_all, every device will hold 1/P part of heads, and the it will be sent to perform local attention computing.
I am doubt that it means Query still has shape [N, d/P], which will still occur $$O(N^2)$$ mem consumption in naive attention computing. Still huge.
Why could this work?
@samadejacobs
in my opinion, Ulysses is another form of TP
@Momo-Tori , yes, Ulysses is a form of TP in the sense that attention block is head parallel. In general, Ulysses is sequence parallelism + head parallelism. It starts out as sequence parallel then transforms to head parallel with all2all function and then back to sequence parallel after attention block. @foreverlms, yes, dense attention computation is O(N^2), we leverage/integrate system optimization like Flash Attention (1 and 2) and ZeRO and algorithmic innovations in sparse attention to scale to extreme long context (1M+) LLMs.