verl [rollout] feat: add vllm pipeline parallel support for zmq executor

What does this PR do?

Mostly based on https://github.com/volcengine/verl/pull/2784. Fix the socket error mentioned in the original PR.

Checklist Before Starting

[ ] Search for similar PRs. Paste at least one query link here: ...
[ ] Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

[ ] Read the Contribute Guide.
[ ] Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
[ ] Add / Update the documentation.
[ ] Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
[ ] Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

Aug 01 '25 00:08 snie2012

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

:white_check_mark: eric-haibin-lin
:x: Shaoliang Nie
:x: snie2012

Shaoliang Nie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Aug 01 '25 00:08 CLAassistant

Could you help run an end-to-end example and show the accuracy matches?

Aug 01 '25 02:08 vermouth1992

Could you help run an end-to-end example and show the accuracy matches?

Which example do you recommend running? grpo.sh?

Aug 01 '25 17:08 snie2012

@vermouth1992 @eric-haibin-lin Had a run comparing pp=1 and pp=2 for grpo.sh with 2 H100 nodes. The rewards are comparable: Screenshot 2025-08-01 at 12 57 50 PM

But in my run pp=2 is slower than pp=1 (red line is pp=1): Screenshot 2025-08-01 at 1 03 46 PM

The exact config I have is:

MODEL_PATH=models--Qwen--Qwen2.5-7B-Instruct
# For async rollout mode, dataset should return raw chat.
rollout_mode="async"
return_raw_chat="False"
if [ "$rollout_mode" = "async" ]; then
    export VLLM_USE_V1=1
    return_raw_chat="True"
fi

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$DATA_PATH/gsm8k/train.jsonl \
    data.val_files=$DATA_PATH/gsm8k/test.jsonl \
    data.train_batch_size=512 \
    data.max_prompt_length=4096 \
    data.max_response_length=4096 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.return_raw_chat=$return_raw_chat \
    actor_rollout_ref.model.path="$MODEL_PATH" \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=512 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.strategy=fsdp2 \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.pipeline_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.rollout.mode=$rollout_mode \
    actor_rollout_ref.rollout.calculate_log_probs=True \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.strategy=fsdp2 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','tensorboard'] \
    trainer.project_name='verl_grpo_example_gsm8k_record' \
    trainer.experiment_name='qwen2_7b_function_rm_re' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.save_freq=5 \
    trainer.test_freq=5 \
    trainer.total_epochs=3 \

Aug 01 '25 20:08 snie2012

@snie2012 that's expected. I'd recommend users to switch from TP16 to TP8PP2 who wants to run large models but does not have RDMA/IB across nodes

Aug 02 '25 17:08 eric-haibin-lin

@snie2012 that's expected. I'd recommend users to switch from TP16 to TP8PP2 who wants to run large models but does not have RDMA/IB across nodes

sounds good, lmk any suggestion for the current pr to land, thanks

Aug 02 '25 21:08 snie2012

pls fix pre-commit error. i guess we have to wait for https://github.com/vllm-project/vllm/pull/21786

Aug 02 '25 23:08 eric-haibin-lin

pls fix pre-commit error. i guess we have to wait for vllm-project/vllm#21786

thanks, updated

Aug 04 '25 16:08 snie2012

@eric-haibin-lin appreciate if you can take a look to see if we can close this, thanks

Aug 05 '25 21:08 snie2012

I hit this error when running with:

vlllm 0.9.1
rollout_mode="async"
VLLM_USE_V1=1

VLLM_USE_V1=0 also can not work.

NotImplementedError: VLLM_USE_V1=1 is not supported with Pipeline Parallelism without Ray distributed executor or multiprocessing executor or external launcher.

What is the baseline to use pp?

Aug 07 '25 06:08 chenhaiq

I hit this error when running with:

vlllm 0.9.1

rollout_mode="async"

VLLM_USE_V1=1

VLLM_USE_V1=0 also can not work.

NotImplementedError: VLLM_USE_V1=1 is not supported with Pipeline Parallelism without Ray distributed executor or multiprocessing executor or external launcher.

What is the baseline to use pp?

maybe missed the patch metioned here: https://github.com/volcengine/verl/pull/2784

Aug 07 '25 08:08 Yangruipis

I hit this error when running with:

vlllm 0.9.1

rollout_mode="async"

VLLM_USE_V1=1

VLLM_USE_V1=0 also can not work. NotImplementedError: VLLM_USE_V1=1 is not supported with Pipeline Parallelism without Ray distributed executor or multiprocessing executor or external launcher. What is the baseline to use pp?

maybe missed the patch metioned here: #2784

Yes you need to patch the vllm change or change that locally to test

Aug 07 '25 17:08 snie2012

rebase and fix format

Aug 07 '25 17:08 snie2012

rebase and fix format

tests should pass this time cc @eric-haibin-lin

Aug 10 '25 18:08 snie2012

rebase and fix format

tests should pass this time cc @eric-haibin-lin

please fix failed ci and rebase with main branch. Do we need to patch vllm in 0.10.0?

Aug 13 '25 03:08 chenhaiq

rebase and fix format

tests should pass this time cc @eric-haibin-lin

please fix failed ci and rebase with main branch. Do we need to patch vllm in 0.10.0?

Looks like the patch is not included in vllm 0.10.0, so still need to patch locally till the next release.

Meanwhile, I found the same code hangs after rebasing to latest main. Will have to look into it.

Aug 13 '25 16:08 snie2012

I hit this error when running with:

vlllm 0.9.2 rollout_mode="async" VLLM_USE_V1=1 return_raw_chat=True pp>1

when step3 or step4， [rank1]:[E903 10:49:42.073383441 ProcessGroupNCCL.cpp:1896] [PG ID 5 PG GUID 15 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered

Set CUDA_LAUNCH_BLOCKING=1 can resolve this bug. However, this does not seem to be a very good solution. please @snie2012

Sep 03 '25 04:09 Tecorigin

I hit this error when running with:

vlllm 0.9.2 rollout_mode="sync" pp>1

In sync mode, the error remains unchanged—even with the feature enabled, it’s the same as when it was disabled. please @snie2012

Sep 03 '25 05:09 lulu51230

Hi as mentioned the same code stops working after rebasing to the latest main. I haven't got time figuring it out yet.

Sep 08 '25 16:09 snie2012

@snie2012 Hi, would you mind if I open a new PR since I've resolved the issues in master branch + vllm 0.10.0? I've done the accuracy verification and some profiling about it.

Sep 26 '25 10:09 SanftMonster

@snie2012 Hi, would you mind if I open a new PR since I've resolved the issues in master branch + vllm 0.10.0? I've done the accuracy verification and some profiling about it.

Sure please go ahead.

Sep 26 '25 15:09 snie2012

verl verl copied to clipboard

[rollout] feat: add vllm pipeline parallel support for zmq executor

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

verl
verl copied to clipboard