verl
verl copied to clipboard
[rollout] feat: add vllm pipeline parallel support for zmq executor
What does this PR do?
Mostly based on https://github.com/volcengine/verl/pull/2784. Fix the socket error mentioned in the original PR.
Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc] {type}is infeat,fix,refactor,chore,test- If this PR breaks any API (CLI arguments, config, function signature, etc.), add
[BREAKING]to the beginning of the title. - Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use this
Design & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- [ ] Read the Contribute Guide.
- [ ] Apply pre-commit checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - [ ] Add / Update the documentation.
- [ ] Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.
:white_check_mark: eric-haibin-lin
:x: Shaoliang Nie
:x: snie2012
Shaoliang Nie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.
Could you help run an end-to-end example and show the accuracy matches?
Could you help run an end-to-end example and show the accuracy matches?
Which example do you recommend running? grpo.sh?
@vermouth1992 @eric-haibin-lin Had a run comparing pp=1 and pp=2 for grpo.sh with 2 H100 nodes. The rewards are comparable:
But in my run pp=2 is slower than pp=1 (red line is pp=1):
The exact config I have is:
MODEL_PATH=models--Qwen--Qwen2.5-7B-Instruct
# For async rollout mode, dataset should return raw chat.
rollout_mode="async"
return_raw_chat="False"
if [ "$rollout_mode" = "async" ]; then
export VLLM_USE_V1=1
return_raw_chat="True"
fi
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$DATA_PATH/gsm8k/train.jsonl \
data.val_files=$DATA_PATH/gsm8k/test.jsonl \
data.train_batch_size=512 \
data.max_prompt_length=4096 \
data.max_response_length=4096 \
data.filter_overlong_prompts=True \
data.truncation='error' \
data.return_raw_chat=$return_raw_chat \
actor_rollout_ref.model.path="$MODEL_PATH" \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=512 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.strategy=fsdp2 \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.pipeline_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=4 \
actor_rollout_ref.rollout.mode=$rollout_mode \
actor_rollout_ref.rollout.calculate_log_probs=True \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.ref.strategy=fsdp2 \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console','tensorboard'] \
trainer.project_name='verl_grpo_example_gsm8k_record' \
trainer.experiment_name='qwen2_7b_function_rm_re' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
trainer.save_freq=5 \
trainer.test_freq=5 \
trainer.total_epochs=3 \
@snie2012 that's expected. I'd recommend users to switch from TP16 to TP8PP2 who wants to run large models but does not have RDMA/IB across nodes
@snie2012 that's expected. I'd recommend users to switch from TP16 to TP8PP2 who wants to run large models but does not have RDMA/IB across nodes
sounds good, lmk any suggestion for the current pr to land, thanks
pls fix pre-commit error. i guess we have to wait for https://github.com/vllm-project/vllm/pull/21786
@eric-haibin-lin appreciate if you can take a look to see if we can close this, thanks
I hit this error when running with:
- vlllm 0.9.1
- rollout_mode="async"
- VLLM_USE_V1=1
VLLM_USE_V1=0 also can not work.
NotImplementedError: VLLM_USE_V1=1 is not supported with Pipeline Parallelism without Ray distributed executor or multiprocessing executor or external launcher.
What is the baseline to use pp?
I hit this error when running with:
- vlllm 0.9.1
- rollout_mode="async"
- VLLM_USE_V1=1
VLLM_USE_V1=0 also can not work.
NotImplementedError: VLLM_USE_V1=1 is not supported with Pipeline Parallelism without Ray distributed executor or multiprocessing executor or external launcher.
What is the baseline to use pp?
maybe missed the patch metioned here: https://github.com/volcengine/verl/pull/2784
I hit this error when running with:
- vlllm 0.9.1
- rollout_mode="async"
- VLLM_USE_V1=1
VLLM_USE_V1=0 also can not work. NotImplementedError: VLLM_USE_V1=1 is not supported with Pipeline Parallelism without Ray distributed executor or multiprocessing executor or external launcher. What is the baseline to use pp?
maybe missed the patch metioned here: #2784
Yes you need to patch the vllm change or change that locally to test
rebase and fix format
rebase and fix format
tests should pass this time cc @eric-haibin-lin
rebase and fix format
tests should pass this time cc @eric-haibin-lin
please fix failed ci and rebase with main branch. Do we need to patch vllm in 0.10.0?
rebase and fix format
tests should pass this time cc @eric-haibin-lin
please fix failed ci and rebase with main branch. Do we need to patch vllm in 0.10.0?
Looks like the patch is not included in vllm 0.10.0, so still need to patch locally till the next release.
Meanwhile, I found the same code hangs after rebasing to latest main. Will have to look into it.
I hit this error when running with:
vlllm 0.9.2 rollout_mode="async" VLLM_USE_V1=1 return_raw_chat=True pp>1
when step3 or step4, [rank1]:[E903 10:49:42.073383441 ProcessGroupNCCL.cpp:1896] [PG ID 5 PG GUID 15 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Set CUDA_LAUNCH_BLOCKING=1 can resolve this bug. However, this does not seem to be a very good solution. please @snie2012
I hit this error when running with:
vlllm 0.9.2 rollout_mode="sync" pp>1
In sync mode, the error remains unchanged—even with the feature enabled, it’s the same as when it was disabled. please @snie2012
Hi as mentioned the same code stops working after rebasing to the latest main. I haven't got time figuring it out yet.
@snie2012 Hi, would you mind if I open a new PR since I've resolved the issues in master branch + vllm 0.10.0? I've done the accuracy verification and some profiling about it.
@snie2012 Hi, would you mind if I open a new PR since I've resolved the issues in master branch + vllm 0.10.0? I've done the accuracy verification and some profiling about it.
Sure please go ahead.
