SpecForge [RFC]: Integrate USP (Ulysses + Ring Attention) for Context Parallelism in SpecForge

1. Motivation

Training 16k-length sequences currently causes OOM errors https://github.com/sgl-project/SpecForge/issues/112. To support 100k+ sequences, we need efficient context parallelism (CP). Per https://arxiv.org/abs/2405.07719, USP (Ulysses + Ring Attention) outperforms standalone approaches, making it our top choice.

2. Proposal

Integrate USP into SpecForge. This hybrid approach combines: Ulysses: Offers better performance Ring Attention: Enables support for longer sequence lengths

3. Expected Benefits

Enable 100k+ sequence training without OOM Maintain computational efficiency Preserve model accuracy at scale

Nov 13 '25 14:11 uygnef

Any progress on this feature?

Nov 21 '25 06:11 FrankLeeeee

Any progress on this feature?

In progress. Expected to be completed around next Wednesday. Will first finish the SDPA version. The official nightly build of FlexAttention is already available, and will integrate it later.

Nov 21 '25 07:11 uygnef

Sure, awesome.

Nov 21 '25 07:11 FrankLeeeee

Any progress on this feature?

In progress. Expected to be completed around next Wednesday. Will first finish the SDPA version. The official nightly build of FlexAttention is already available, and will integrate it later.

Hello, I'd like to know how your development is going? If there's any progress, we'd be happy to collaborate on the development.

Dec 11 '25 02:12 jiapingW

https://github.com/sgl-project/SpecForge/pull/363 @jiapingW

Hi, the feature is mostly done. There was a drop in performance after the merge, but I've traced it back to an issue with the LR scheduler and will fix it later.

If you're in a hurry, feel free to try the PR above. It's stable for offline use, though the online implementation still lacks the hidden state TP-to-SP mapping.

Dec 12 '25 12:12 uygnef

Great, I'll test its memory optimization effect in the next couple of days.

Dec 12 '25 16:12 jiapingW

I test use the following command on 4 X H20GPUs.

# CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \
#     --nproc_per_node 4 \
#     --standalone \
#     scripts/train_eagle3.py \
#     --target-model-path Qwen/Qwen2.5-7B \
#     --draft-model-config $ROOT_DIR/configs/qwen2.5-7b-eagle3.json \
#     --train-data-path /disk3/wjp/datasets/1w.jsonl \
#     --train-hidden-states-path $HIDDEN_STATE_PATH \
#     --output-dir $ROOT_DIR/outputs/1w_usp \
#     --num-epochs 4 \
#     --tp-size 1 \
#     --learning-rate 1e-4 \
#     --attention-backend usp \
#     --sp-ulysses-size 2 \
#     --sp-ring-size 2 \
#     --max-length 16384 \
#     --chat-template qwen \
#     --cache-dir $ROOT_DIR/cache \
#     --report tensorboard

The result was that the single GPU's video memory usage was over 90GB, while my single GPU using only DP used only 50GB.

Dec 15 '25 09:12 jiapingW

、

I test use the following command on 4 X H20GPUs.

# CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \
#     --nproc_per_node 4 \
#     --standalone \
#     scripts/train_eagle3.py \
#     --target-model-path Qwen/Qwen2.5-7B \
#     --draft-model-config $ROOT_DIR/configs/qwen2.5-7b-eagle3.json \
#     --train-data-path /disk3/wjp/datasets/1w.jsonl \
#     --train-hidden-states-path $HIDDEN_STATE_PATH \
#     --output-dir $ROOT_DIR/outputs/1w_usp \
#     --num-epochs 4 \
#     --tp-size 1 \
#     --learning-rate 1e-4 \
#     --attention-backend usp \
#     --sp-ulysses-size 2 \
#     --sp-ring-size 2 \
#     --max-length 16384 \
#     --chat-template qwen \
#     --cache-dir $ROOT_DIR/cache \
#     --report tensorboard

The result was that the single GPU's video memory usage was over 90GB, while my single GPU using only DP used only 50GB.

Currently, FlexAttention isn’t supported yet, so SDPA uses more memory. We'll add FlexAttention support later.

If you test with DP, you can check memory usage with --attention-backend sdpa.

For 16K length, we've tested and confirmed --sp_ulysses_size 2 --sp_ring_size 4 runs successfully.

Dec 15 '25 09:12 uygnef

Great. I use --sp_ulysses_size 2 --sp_ring_size 4 that will use 58G VRAM per GPU which use less VRAM than origin sdpa.

Dec 15 '25 11:12 jiapingW