verl
verl copied to clipboard
Does Verl Support for Running PPO or GRPO Algorithms on QWEN2.5 72B Model
Hello,
I am interested in running reinforcement learning algorithms, specifically PPO (Proximal Policy Optimization) or GRPO (Generalized Reinforcement Policy Optimization), on the QWEN2.5 72B model. I have a few questions regarding the setup and requirements:
- Compatibility: Does your framework currently support the integration of PPO or GRPO algorithms with the QWEN2.5 72B model?
- Configuration Requirements: If supported, what are the recommended configurations for running these algorithms effectively?
- GPU Resources: What are the GPU requirements for running these algorithms with the QWEN2.5 72B model?
- Distributed Training: Would you recommend using FSDP (Fully Sharded Data Parallel) or Megatron for distributed training?
- Parallelism Settings: How should TP (Tensor Parallelism) , PP (Pipeline Parallelism) and SP be set up for optimal performance?
Any guidance or resources you could provide would be greatly appreciated.
Thank you!
+1
+1
+1
+1
+1
+1
+1
I'm excited to share that GPRO has been successfully validated on a 2-node H20 cluster (8 GPUs per node, 16 GPUs total, 96GB memory per GPU) using VERL. A detailed implementation report with performance benchmarks and lessons learned will be shared soon!
There's some reference performance number from the community: https://www.volcengine.com/docs/6459/1463942 (although lacking the result of qwen 72b model)
Hello, any updates?
Hi I tried to train 72B on 2*8 A800(80G) all batch_size_per_gpu is set to 1, but still encountered OOM, seems that happens in vllm stage, but TP_size is 8, (maybe increase the gpu_memory_utilization=0.4? also wondering why here can only use a rather small part of GPU?) any suggestion? script
PROJ_HOME='/mnt/data/RL/verl/'
cd $PROJ_HOME
YOUR_PROJECT_NAME="verl_test"
YOUR_RUN_NAME="72B_gsm8k_PPO"
# model="/mnt/data/qwen/model/Qwen2.5-3B-Instruct"
model="/mnt/data/qwen/model/Qwen2.5-72B-Instruct"
# model="/mnt/data/end_side/model/Qwen2.5-0.5B-Instruct"
actor_model_path=$model
critic_model_path=$model
# export NPROC_PER_NODE=1
export NPROC_PER_NODE=8
export ROLLOUT_TP_SIZE=8
export NNODES=2
HYDRA_FULL_ERROR=1
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
data.train_files=$PROJ_HOME/data/gsm8k/train.parquet \
data.val_files=$PROJ_HOME/data/gsm8k/test.parquet \
data.train_batch_size=64 \
data.max_prompt_length=512 \
data.max_response_length=256 \
actor_rollout_ref.model.path=$actor_model_path \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.n=4 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
critic.optim.lr=1e-5 \
critic.model.path=$critic_model_path \
critic.ppo_micro_batch_size_per_gpu=1 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['console'] \
+trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$NPROC_PER_NODE \
trainer.nnodes=${NNODES:-1} \
trainer.save_freq=1000 \
trainer.test_freq=1 \
trainer.total_epochs=15 \
trainer.logger=['console','wandb'] \
trainer.project_name=${YOUR_PROJECT_NAME} \
trainer.experiment_name=$YOUR_RUN_NAME 2>&1 | tee verl_demo_$(date +%m%d-%H%M).log
log:
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] self.model_executor.sync_model_weights(actor_weights=actor_weights, load_format=load_format)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] File "/mnt/data/RL/verl/verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py", line 213, in sync_model_weights
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] self.worker.sync_model_weights(actor_weights=actor_weights, load_format=load_format)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] File "/mnt/data/RL/verl/verl/third_party/vllm/vllm_v_0_6_3/worker.py", line 281, in sync_model_weights
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] load_dtensor_weights(actor_weights, self.model_runner.model)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] File "/mnt/data/RL/verl/verl/third_party/vllm/vllm_v_0_6_3/dtensor_weight_loaders.py", line 368, in load_dtensor_weights
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] vllm_model = vllm_model.cuda()
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 916, in cuda
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] return self._apply(lambda t: t.cuda(device))
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] module._apply(fn)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] module._apply(fn)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] module._apply(fn)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] [Previous line repeated 2 more times]
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] param_applied = fn(param)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 916, in <lambda>
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] return self._apply(lambda t: t.cuda(device))
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 232.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 195.81 MiB is free. Process 3906 has 79.12 GiB memory in use. Of the allocated memory 76.78 GiB is allocated by PyTorch, and 82.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]
If you want to use vLLM and 72B_model for sampling, a single card will result in an OOM (out of memory) error, but two cards will work.
If you want to use vLLM and 72B_model for sampling, a single card will result in an OOM (out of memory) error, but two cards will work.
Hi, thank you for your clue, but I tried to give vllm 16 cards(also reduced max length), wondering is there any sample for 72B?
export ROLLOUT_TP_SIZE=16
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
...
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
If you want to use vLLM and 72B_model for sampling, a single card will result in an OOM (out of memory) error, but two cards will work.
Hi, thank you for your clue, but I tried to give vllm 16 cards(also reduced max length), wondering is there any sample for 72B?
export ROLLOUT_TP_SIZE=16 PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ ... actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
Did you encounter OOM, or other errors? Cross mode TP rollout should be functional. Here's an example https://github.com/volcengine/verl/commit/7646e08fca74183baa2790690456a2aa7568fb55
We provided some example script for models with different sizes here: https://verl.readthedocs.io/en/latest/perf/device_tuning.html
We are looking for contributors with more data points. Please submit PR if you can provide data point of new model sizes
We provided some example script for models with different sizes here: https://verl.readthedocs.io/en/latest/perf/device_tuning.html
We are looking for contributors with more data points. Please submit PR if you can provide data point of new model sizes
Thanks for your reply! But the sample script you provided seems to be broken (404).
I'm excited to share that GPRO has been successfully validated on a 2-node H20 cluster (8 GPUs per node, 16 GPUs total, 96GB memory per GPU) using VERL. A detailed implementation report with performance benchmarks and lessons learned will be shared soon!
@none0663 How about this report? I am facing the same problem!