verl icon indicating copy to clipboard operation
verl copied to clipboard

[rollout] feat: add vllm pipeline parallel support for zmq executor

Open snie2012 opened this issue 3 months ago • 21 comments

What does this PR do?

Mostly based on https://github.com/volcengine/verl/pull/2784. Fix the socket error mentioned in the original PR.

Checklist Before Starting

  • [ ] Search for similar PRs. Paste at least one query link here: ...
  • [ ] Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

snie2012 avatar Aug 01 '25 00:08 snie2012

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

:white_check_mark: eric-haibin-lin
:x: Shaoliang Nie
:x: snie2012


Shaoliang Nie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Aug 01 '25 00:08 CLAassistant

Could you help run an end-to-end example and show the accuracy matches?

vermouth1992 avatar Aug 01 '25 02:08 vermouth1992

Could you help run an end-to-end example and show the accuracy matches?

Which example do you recommend running? grpo.sh?

snie2012 avatar Aug 01 '25 17:08 snie2012

@vermouth1992 @eric-haibin-lin Had a run comparing pp=1 and pp=2 for grpo.sh with 2 H100 nodes. The rewards are comparable: Screenshot 2025-08-01 at 12 57 50 PM

But in my run pp=2 is slower than pp=1 (red line is pp=1): Screenshot 2025-08-01 at 1 03 46 PM

The exact config I have is:

MODEL_PATH=models--Qwen--Qwen2.5-7B-Instruct
# For async rollout mode, dataset should return raw chat.
rollout_mode="async"
return_raw_chat="False"
if [ "$rollout_mode" = "async" ]; then
    export VLLM_USE_V1=1
    return_raw_chat="True"
fi

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$DATA_PATH/gsm8k/train.jsonl \
    data.val_files=$DATA_PATH/gsm8k/test.jsonl \
    data.train_batch_size=512 \
    data.max_prompt_length=4096 \
    data.max_response_length=4096 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.return_raw_chat=$return_raw_chat \
    actor_rollout_ref.model.path="$MODEL_PATH" \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=512 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.strategy=fsdp2 \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.pipeline_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.rollout.mode=$rollout_mode \
    actor_rollout_ref.rollout.calculate_log_probs=True \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.strategy=fsdp2 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','tensorboard'] \
    trainer.project_name='verl_grpo_example_gsm8k_record' \
    trainer.experiment_name='qwen2_7b_function_rm_re' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.save_freq=5 \
    trainer.test_freq=5 \
    trainer.total_epochs=3 \

snie2012 avatar Aug 01 '25 20:08 snie2012

@snie2012 that's expected. I'd recommend users to switch from TP16 to TP8PP2 who wants to run large models but does not have RDMA/IB across nodes

eric-haibin-lin avatar Aug 02 '25 17:08 eric-haibin-lin

@snie2012 that's expected. I'd recommend users to switch from TP16 to TP8PP2 who wants to run large models but does not have RDMA/IB across nodes

sounds good, lmk any suggestion for the current pr to land, thanks

snie2012 avatar Aug 02 '25 21:08 snie2012

pls fix pre-commit error. i guess we have to wait for https://github.com/vllm-project/vllm/pull/21786

eric-haibin-lin avatar Aug 02 '25 23:08 eric-haibin-lin

pls fix pre-commit error. i guess we have to wait for vllm-project/vllm#21786

thanks, updated

snie2012 avatar Aug 04 '25 16:08 snie2012

@eric-haibin-lin appreciate if you can take a look to see if we can close this, thanks

snie2012 avatar Aug 05 '25 21:08 snie2012

I hit this error when running with:

  1. vlllm 0.9.1
  2. rollout_mode="async"
  3. VLLM_USE_V1=1

VLLM_USE_V1=0 also can not work.

NotImplementedError: VLLM_USE_V1=1 is not supported with Pipeline Parallelism without Ray distributed executor or multiprocessing executor or external launcher.

What is the baseline to use pp?

chenhaiq avatar Aug 07 '25 06:08 chenhaiq

I hit this error when running with:

  1. vlllm 0.9.1
  2. rollout_mode="async"
  3. VLLM_USE_V1=1

VLLM_USE_V1=0 also can not work.

NotImplementedError: VLLM_USE_V1=1 is not supported with Pipeline Parallelism without Ray distributed executor or multiprocessing executor or external launcher.

What is the baseline to use pp?

maybe missed the patch metioned here: https://github.com/volcengine/verl/pull/2784
image

Yangruipis avatar Aug 07 '25 08:08 Yangruipis

I hit this error when running with:

  1. vlllm 0.9.1
  2. rollout_mode="async"
  3. VLLM_USE_V1=1

VLLM_USE_V1=0 also can not work. NotImplementedError: VLLM_USE_V1=1 is not supported with Pipeline Parallelism without Ray distributed executor or multiprocessing executor or external launcher. What is the baseline to use pp?

maybe missed the patch metioned here: #2784 image

Yes you need to patch the vllm change or change that locally to test

snie2012 avatar Aug 07 '25 17:08 snie2012

rebase and fix format

snie2012 avatar Aug 07 '25 17:08 snie2012

rebase and fix format

tests should pass this time cc @eric-haibin-lin

snie2012 avatar Aug 10 '25 18:08 snie2012

rebase and fix format

tests should pass this time cc @eric-haibin-lin

please fix failed ci and rebase with main branch. Do we need to patch vllm in 0.10.0?

chenhaiq avatar Aug 13 '25 03:08 chenhaiq

rebase and fix format

tests should pass this time cc @eric-haibin-lin

please fix failed ci and rebase with main branch. Do we need to patch vllm in 0.10.0?

Looks like the patch is not included in vllm 0.10.0, so still need to patch locally till the next release.

Meanwhile, I found the same code hangs after rebasing to latest main. Will have to look into it.

snie2012 avatar Aug 13 '25 16:08 snie2012

I hit this error when running with:

vlllm 0.9.2 rollout_mode="async" VLLM_USE_V1=1 return_raw_chat=True pp>1

when step3 or step4, [rank1]:[E903 10:49:42.073383441 ProcessGroupNCCL.cpp:1896] [PG ID 5 PG GUID 15 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered

Set CUDA_LAUNCH_BLOCKING=1 can resolve this bug. However, this does not seem to be a very good solution. please @snie2012

Tecorigin avatar Sep 03 '25 04:09 Tecorigin

I hit this error when running with:

vlllm 0.9.2 rollout_mode="sync" pp>1

In sync mode, the error remains unchanged—even with the feature enabled, it’s the same as when it was disabled. please @snie2012

lulu51230 avatar Sep 03 '25 05:09 lulu51230

Hi as mentioned the same code stops working after rebasing to the latest main. I haven't got time figuring it out yet.

snie2012 avatar Sep 08 '25 16:09 snie2012

@snie2012 Hi, would you mind if I open a new PR since I've resolved the issues in master branch + vllm 0.10.0? I've done the accuracy verification and some profiling about it.

SanftMonster avatar Sep 26 '25 10:09 SanftMonster

@snie2012 Hi, would you mind if I open a new PR since I've resolved the issues in master branch + vllm 0.10.0? I've done the accuracy verification and some profiling about it.

Sure please go ahead.

snie2012 avatar Sep 26 '25 15:09 snie2012