SpecForge icon indicating copy to clipboard operation
SpecForge copied to clipboard

[RFC]: Integrate USP (Ulysses + Ring Attention) for Context Parallelism in SpecForge

Open uygnef opened this issue 1 month ago • 3 comments

1. Motivation

Training 16k-length sequences currently causes OOM errors https://github.com/sgl-project/SpecForge/issues/112. To support 100k+ sequences, we need efficient context parallelism (CP). Per https://arxiv.org/abs/2405.07719, USP (Ulysses + Ring Attention) outperforms standalone approaches, making it our top choice.

2. Proposal

Integrate USP into SpecForge. This hybrid approach combines: Ulysses: Offers better performance Ring Attention: Enables support for longer sequence lengths

3. Expected Benefits

Enable 100k+ sequence training without OOM Maintain computational efficiency Preserve model accuracy at scale

uygnef avatar Nov 13 '25 14:11 uygnef

Any progress on this feature?

FrankLeeeee avatar Nov 21 '25 06:11 FrankLeeeee

Any progress on this feature?

In progress. Expected to be completed around next Wednesday. Will first finish the SDPA version. The official nightly build of FlexAttention is already available, and will integrate it later.

uygnef avatar Nov 21 '25 07:11 uygnef

Sure, awesome.

FrankLeeeee avatar Nov 21 '25 07:11 FrankLeeeee

Any progress on this feature?

In progress. Expected to be completed around next Wednesday. Will first finish the SDPA version. The official nightly build of FlexAttention is already available, and will integrate it later.

Hello, I'd like to know how your development is going? If there's any progress, we'd be happy to collaborate on the development.

jiapingW avatar Dec 11 '25 02:12 jiapingW

https://github.com/sgl-project/SpecForge/pull/363 @jiapingW

Hi, the feature is mostly done. There was a drop in performance after the merge, but I've traced it back to an issue with the LR scheduler and will fix it later.

If you're in a hurry, feel free to try the PR above. It's stable for offline use, though the online implementation still lacks the hidden state TP-to-SP mapping.

uygnef avatar Dec 12 '25 12:12 uygnef

Great, I'll test its memory optimization effect in the next couple of days.

jiapingW avatar Dec 12 '25 16:12 jiapingW

I test use the following command on 4 X H20GPUs.

# CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \
#     --nproc_per_node 4 \
#     --standalone \
#     scripts/train_eagle3.py \
#     --target-model-path Qwen/Qwen2.5-7B \
#     --draft-model-config $ROOT_DIR/configs/qwen2.5-7b-eagle3.json \
#     --train-data-path /disk3/wjp/datasets/1w.jsonl \
#     --train-hidden-states-path $HIDDEN_STATE_PATH \
#     --output-dir $ROOT_DIR/outputs/1w_usp \
#     --num-epochs 4 \
#     --tp-size 1 \
#     --learning-rate 1e-4 \
#     --attention-backend usp \
#     --sp-ulysses-size 2 \
#     --sp-ring-size 2 \
#     --max-length 16384 \
#     --chat-template qwen \
#     --cache-dir $ROOT_DIR/cache \
#     --report tensorboard

The result was that the single GPU's video memory usage was over 90GB, while my single GPU using only DP used only 50GB.

jiapingW avatar Dec 15 '25 09:12 jiapingW

I test use the following command on 4 X H20GPUs.

# CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \
#     --nproc_per_node 4 \
#     --standalone \
#     scripts/train_eagle3.py \
#     --target-model-path Qwen/Qwen2.5-7B \
#     --draft-model-config $ROOT_DIR/configs/qwen2.5-7b-eagle3.json \
#     --train-data-path /disk3/wjp/datasets/1w.jsonl \
#     --train-hidden-states-path $HIDDEN_STATE_PATH \
#     --output-dir $ROOT_DIR/outputs/1w_usp \
#     --num-epochs 4 \
#     --tp-size 1 \
#     --learning-rate 1e-4 \
#     --attention-backend usp \
#     --sp-ulysses-size 2 \
#     --sp-ring-size 2 \
#     --max-length 16384 \
#     --chat-template qwen \
#     --cache-dir $ROOT_DIR/cache \
#     --report tensorboard

The result was that the single GPU's video memory usage was over 90GB, while my single GPU using only DP used only 50GB.

Currently, FlexAttention isn’t supported yet, so SDPA uses more memory. We'll add FlexAttention support later.

If you test with DP, you can check memory usage with --attention-backend sdpa.

For 16K length, we've tested and confirmed --sp_ulysses_size 2 --sp_ring_size 4 runs successfully.

uygnef avatar Dec 15 '25 09:12 uygnef

Great. I use --sp_ulysses_size 2 --sp_ring_size 4 that will use 58G VRAM per GPU which use less VRAM than origin sdpa.

jiapingW avatar Dec 15 '25 11:12 jiapingW