[RFC]: Integrate USP (Ulysses + Ring Attention) for Context Parallelism in SpecForge
1. Motivation
Training 16k-length sequences currently causes OOM errors https://github.com/sgl-project/SpecForge/issues/112. To support 100k+ sequences, we need efficient context parallelism (CP). Per https://arxiv.org/abs/2405.07719, USP (Ulysses + Ring Attention) outperforms standalone approaches, making it our top choice.
2. Proposal
Integrate USP into SpecForge. This hybrid approach combines: Ulysses: Offers better performance Ring Attention: Enables support for longer sequence lengths
3. Expected Benefits
Enable 100k+ sequence training without OOM Maintain computational efficiency Preserve model accuracy at scale
Any progress on this feature?
Any progress on this feature?
In progress. Expected to be completed around next Wednesday. Will first finish the SDPA version. The official nightly build of FlexAttention is already available, and will integrate it later.
Sure, awesome.
Any progress on this feature?
In progress. Expected to be completed around next Wednesday. Will first finish the SDPA version. The official nightly build of FlexAttention is already available, and will integrate it later.
Hello, I'd like to know how your development is going? If there's any progress, we'd be happy to collaborate on the development.
https://github.com/sgl-project/SpecForge/pull/363 @jiapingW
Hi, the feature is mostly done. There was a drop in performance after the merge, but I've traced it back to an issue with the LR scheduler and will fix it later.
If you're in a hurry, feel free to try the PR above. It's stable for offline use, though the online implementation still lacks the hidden state TP-to-SP mapping.
Great, I'll test its memory optimization effect in the next couple of days.
I test use the following command on 4 X H20GPUs.
# CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \
# --nproc_per_node 4 \
# --standalone \
# scripts/train_eagle3.py \
# --target-model-path Qwen/Qwen2.5-7B \
# --draft-model-config $ROOT_DIR/configs/qwen2.5-7b-eagle3.json \
# --train-data-path /disk3/wjp/datasets/1w.jsonl \
# --train-hidden-states-path $HIDDEN_STATE_PATH \
# --output-dir $ROOT_DIR/outputs/1w_usp \
# --num-epochs 4 \
# --tp-size 1 \
# --learning-rate 1e-4 \
# --attention-backend usp \
# --sp-ulysses-size 2 \
# --sp-ring-size 2 \
# --max-length 16384 \
# --chat-template qwen \
# --cache-dir $ROOT_DIR/cache \
# --report tensorboard
The result was that the single GPU's video memory usage was over 90GB, while my single GPU using only DP used only 50GB.
、
I test use the following command on 4 X H20GPUs.
# CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \ # --nproc_per_node 4 \ # --standalone \ # scripts/train_eagle3.py \ # --target-model-path Qwen/Qwen2.5-7B \ # --draft-model-config $ROOT_DIR/configs/qwen2.5-7b-eagle3.json \ # --train-data-path /disk3/wjp/datasets/1w.jsonl \ # --train-hidden-states-path $HIDDEN_STATE_PATH \ # --output-dir $ROOT_DIR/outputs/1w_usp \ # --num-epochs 4 \ # --tp-size 1 \ # --learning-rate 1e-4 \ # --attention-backend usp \ # --sp-ulysses-size 2 \ # --sp-ring-size 2 \ # --max-length 16384 \ # --chat-template qwen \ # --cache-dir $ROOT_DIR/cache \ # --report tensorboardThe result was that the single GPU's video memory usage was over 90GB, while my single GPU using only DP used only 50GB.
Currently, FlexAttention isn’t supported yet, so SDPA uses more memory. We'll add FlexAttention support later.
If you test with DP, you can check memory usage with --attention-backend sdpa.
For 16K length, we've tested and confirmed --sp_ulysses_size 2 --sp_ring_size 4 runs successfully.
Great. I use --sp_ulysses_size 2 --sp_ring_size 4 that will use 58G VRAM per GPU which use less VRAM than origin sdpa.