DeepSpeed [Q&A] Why Deepspeed Ulysses could support long sequence length?

[Q&A] Why Deepspeed Ulysses could support long sequence length?

Open foreverlms opened this issue 1 year ago • 2 comments

Sorry to post the question here.

According to the paper, after all_to_all, every device will hold 1/P part of heads, and the it will be sent to perform local attention computing.

I am doubt that it means Query still has shape [N, d/P], which will still occur $$O(N^2)$$ mem consumption in naive attention computing. Still huge.

Why could this work?

@samadejacobs

Apr 19 '24 01:04 foreverlms

in my opinion, Ulysses is another form of TP

May 10 '24 15:05 Momo-Tori

@Momo-Tori , yes, Ulysses is a form of TP in the sense that attention block is head parallel. In general, Ulysses is sequence parallelism + head parallelism. It starts out as sequence parallel then transforms to head parallel with all2all function and then back to sequence parallel after attention block. @foreverlms, yes, dense attention computation is O(N^2), we leverage/integrate system optimization like Flash Attention (1 and 2) and ZeRO and algorithmic innovations in sparse attention to scale to extreme long context (1M+) LLMs.

May 10 '24 20:05 samadejacobs

DeepSpeed DeepSpeed copied to clipboard

[Q&A] Why Deepspeed Ulysses could support long sequence length?

DeepSpeed
DeepSpeed copied to clipboard