verl
verl copied to clipboard
vllm generate speed difference under sync and async mode
We tested the performance of model training in async and sync modes. and we found that the rollout performance in async mode was significantly lower than in sync mode, (especially in scenarios with long response lengths). our verl version is 0.6.1 and our vllm version is 0.11.0
Below are our running script and wandb results.
rollout_mode="sync"
rollout_name="vllm" # sglang or vllm
if [ "$rollout_mode" = "async" ]; then
export VLLM_USE_V1=1
return_raw_chat="True"
fi
python3 -m verl.trainer.main_ppo \
--config-path=config \
--config-name='ppo_trainer.yaml' \
data.train_files=deepscaler/train.parquet \
data.val_files=test.parquet \
algorithm.adv_estimator=grpo \
data.train_batch_size=2 \
data.val_batch_size=512 \
data.return_raw_chat=True \
data.max_prompt_length=1024 \
data.max_response_length=8192 \
actor_rollout_ref.rollout.max_num_batched_tokens=10240 \
actor_rollout_ref.model.path=DeepSeek-R1-Distill-Qwen-7B \
actor_rollout_ref.actor.optim.lr=2e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.model.use_liger=True \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size=64 \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=10240 \
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=10240 \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=10240 \
reward_model.launch_reward_fn_async=True \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.mode=${rollout_mode} \
actor_rollout_ref.rollout.temperature=0.9 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
actor_rollout_ref.rollout.n=16 \
actor_rollout_ref.rollout.val_kwargs.n=8 \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
actor_rollout_ref.rollout.val_kwargs.top_p=0.95 \
actor_rollout_ref.rollout.val_kwargs.temperature=0.6 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0 \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.val_before_train=False \
trainer.n_gpus_per_node=8 \
trainer.nnodes=${WORLD_SIZE} \
trainer.save_freq=-1 \
trainer.test_freq=-1 \
trainer.default_hdfs_dir=null \
trainer.total_epochs=5 "${@:1}" \
trainer.default_local_dir=${OUTPUT_DIR} \
data.filter_overlong_prompts=True \
reward_model.reward_manager=naive \
actor_rollout_ref.actor.strategy=fsdp2 \
critic.strategy=fsdp2 \
To prevent the impact of distribution requests from both sides, we set a very small batch size, but problems still arise. We want to know if asynchronous mode affects the generation performance of vLLM.