DeepSpeed
DeepSpeed copied to clipboard
Deepspeed Ulysses
Ring Attention should work with Deepspeed Ulysses, correct? Are there any notable issues combining deepspeed's efficient sequence parallelism with such an attention mechanism? I do understand flash attention works.
https://github.com/zhuzilin/ring-flash-attention
Ulysses is, in principle, attention-type agnostic. Although we haven’t specifically tested Ulysses with Ring Attention, as long as the qkv can be split or sharded along sequence and head dimensions, it should work. Contributions are welcome!
Hi @samadejacobs,
I appreciate the insight.
I will have to test both of them in conjunction together and let you know.
Thank you,
Enrico