What does this PR do?

This PR enhances capabilities of DeepSpeed long sequence (context) parallelism (aka DS Ulysses) with support for HF models. Support is currently enabled when both DeepSpeed and flash attn are enabled. Future support would be extended to SPDA. All current and future HF models (such as Llama, opt etc) using refactored flash_attention_utils are supported.

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@muellerzr

Jul 29 '24 21:07 samadejacobs

cc @SunMarc

Jul 30 '24 09:07 LysandreJik

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Aug 19 '24 23:08 HuggingFaceDocBuilderDev

@samadejacobs I'm glad to see this pr will be merged soon. When are you going to support sdpa in the future? It's useful for me.

Sep 21 '24 14:09 glowwormX

@samadejacobs I'm glad to see this pr will be merged soon. When are you going to support sdpa in the future? It's useful for me.

Yes, but I want to run it on the npu, but it doesn't support flash2, but sdpa.

Sep 21 '24 16:09 glowwormX

@samadejacobs anything I can do to help get this merged?

Sep 27 '24 15:09 ArthurZucker

@samadejacobs I'm glad to see this pr will be merged soon. When are you going to support sdpa in the future? It's useful for me.

@glowwormX, future support would be extended to SPDA.

Oct 02 '24 21:10 samadejacobs

@samadejacobs anything I can do to help get this merged?

@ArthurZucker, many thanks, please see my earlier response.

Oct 02 '24 21:10 samadejacobs

Hey @samadejacobs ! Ah, I am not sure I have bandwidth right now for https://github.com/huggingface/transformers/pull/32305#discussion_r1769601049, but if you can't do it will see if I can ping someone or do it!

Oct 03 '24 14:10 ArthurZucker

Hi, do I understand correctly that this PR deals only with attention support? As far as I understand, sequence parallelism requires that two or more workers are given the same data, and this requires adjustments in trainers and dataloaders.

Nov 20 '24 10:11 pavelgein

@samadejacobs so how is this going now?

Dec 13 '24 09:12 jyshee

I have been working on this for my team in Bloomberg. I think I may have a PR that merges changes from everybody that works. Will share shortly and hopefully we can get this feature in.

Dec 14 '24 05:12 ronald-d-rogers

cc @XuehaiPan

Dec 30 '24 14:12 SunMarc

Hi, I'd like to know about plan to merge it and I would be happy to help with it.

I have recently implemented Ulyssys attention in our turbo-alignment library, which is built on Transformers library and heavily uses both Transformers models and Trainer. Despite not being merged, that PR is quite well tested and we have sucesfully trained several models with large context with this code.

From this experience I have learnt, that to achieve this, we need to take several steps.

Add or patch existing attention modules. Like in this PR, this could be done with a decorator. The problem is eager attention module, which is repeated across many models, Gemma and Llama for example. It can be questioned whether one, who would like to use context parallelism, will use eager implementation, so IMHO, eager attention can be left for later.
Adjust all computations which use world size. For example, number of training steps in the Trainer code, and in some places in the Accelerate, for example in data loader preparing. This is important for right datasets sharding across workers.
Perfectly synchronize all the random generators inside one sequence parallel group, since otherwise, one can get different permutations
Deal with position numbering and KV caches. As of today, in the absence of cache, positions_ids will be assigned to numbers from 0 to seq_length - 1 and this has to be adjusted to reflect that i-th worker in the sequence parallel group see only subset of tokens (see Gemma 2 code for example. This is not hard to do. But I don't know how to deal with caches, in our implementation we explicitly disable them.
Attention masks. It seem right, that each worker inside sequence parallel group sees all the attention mask (worker needs it to compute the attention), so some around computing lengths and cache positions should be adjusted.
Generation. The generation process requires that we at each step regroup the whole sentence. Also, at each step we have to deal with KV caches
Workers discovery. As far as I know, there are two approaches: model parallel unit and torch device mesh.
Loss computation. We have taken a Deepspeed approach to compute the cross-entropy. But all other losses, like for sequence classification, should be adjusted as well.

I would love to discuss this issuses and help to deal with them.

I am also aware of another context parallel approach, but seems like this great library can have both, right?

May 23 '25 05:05 pavelgein

We will soon have a new PR from deepspeed team ! I will close this one once they open it. Sorry that this one didn't get merged.

Jun 18 '25 14:06 SunMarc

We will soon have a new PR from deepspeed team ! I will close this one once they open it. Sorry that this one didn't get merged.

Hi @SunMarc, is there any update on this topic? Any expected timeline? Thanks :)

Jul 15 '25 11:07 ciuffredaluca

cc @stas00 do you know when you are planning to land this ?

Jul 15 '25 12:07 SunMarc

@SunMarc, please kindly sync with @S1ro1 - we are waiting for him to complete the redesign of the parallelism in HF Accelerate. https://github.com/huggingface/accelerate/pull/3673

Jul 15 '25 17:07 stas00

@stas00 I think given my limited availability recently and the time I'll be able to get to doing it in DeepSpeed, you can probably just integrate it with DeepSpeed as is and we'll do the refactor separately, else I don't think its entirely realistic to get it merged soon enough.

Jul 15 '25 19:07 S1ro1

you can probably just integrate it with DeepSpeed

I'm not sure what you mean, Matej. It's already in Deepspeed. Unless you mean in the Deepspeed plugin of HF Accelerate?

What's the rough new ETA for your work?

Jul 15 '25 21:07 stas00

transformers
transformers copied to clipboard

DeepSpeed sequence parallelism (aka Ulysses) integration with HF transformer

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

DeepSpeed sequence parallelism (aka Ulysses) integration with HF transformer

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard