verl icon indicating copy to clipboard operation
verl copied to clipboard

Does Verl Support for Running PPO or GRPO Algorithms on QWEN2.5 72B Model

Open none0663 opened this issue 9 months ago • 9 comments

Hello,

I am interested in running reinforcement learning algorithms, specifically PPO (Proximal Policy Optimization) or GRPO (Generalized Reinforcement Policy Optimization), on the QWEN2.5 72B model. I have a few questions regarding the setup and requirements:

  1. Compatibility: Does your framework currently support the integration of PPO or GRPO algorithms with the QWEN2.5 72B model?
  2. Configuration Requirements: If supported, what are the recommended configurations for running these algorithms effectively?
  3. GPU Resources: What are the GPU requirements for running these algorithms with the QWEN2.5 72B model?
  4. Distributed Training: Would you recommend using FSDP (Fully Sharded Data Parallel) or Megatron for distributed training?
  5. Parallelism Settings: How should TP (Tensor Parallelism) , PP (Pipeline Parallelism) and SP be set up for optimal performance?

Any guidance or resources you could provide would be greatly appreciated.

Thank you!

none0663 avatar Feb 11 '25 14:02 none0663

+1

puppet101 avatar Feb 12 '25 02:02 puppet101

+1

JiaxingSong718 avatar Feb 14 '25 06:02 JiaxingSong718

+1

qibao77 avatar Feb 14 '25 11:02 qibao77

+1

Eveosev avatar Feb 17 '25 12:02 Eveosev

+1

zhentingqi avatar Feb 17 '25 21:02 zhentingqi

+1

lihaoling avatar Feb 18 '25 09:02 lihaoling

+1

echo-valor avatar Feb 18 '25 11:02 echo-valor

I'm excited to share that GPRO has been successfully validated on a 2-node H20 cluster (8 GPUs per node, 16 GPUs total, 96GB memory per GPU) using VERL. A detailed implementation report with performance benchmarks and lessons learned will be shared soon!

none0663 avatar Feb 23 '25 14:02 none0663

There's some reference performance number from the community: https://www.volcengine.com/docs/6459/1463942 (although lacking the result of qwen 72b model)

eric-haibin-lin avatar Feb 23 '25 23:02 eric-haibin-lin

Hello, any updates?

wkzcml-1 avatar Mar 13 '25 09:03 wkzcml-1

Hi I tried to train 72B on 2*8 A800(80G) all batch_size_per_gpu is set to 1, but still encountered OOM, seems that happens in vllm stage, but TP_size is 8, (maybe increase the gpu_memory_utilization=0.4? also wondering why here can only use a rather small part of GPU?) any suggestion? script

PROJ_HOME='/mnt/data/RL/verl/'
cd $PROJ_HOME
YOUR_PROJECT_NAME="verl_test"
YOUR_RUN_NAME="72B_gsm8k_PPO"
# model="/mnt/data/qwen/model/Qwen2.5-3B-Instruct"
model="/mnt/data/qwen/model/Qwen2.5-72B-Instruct"

# model="/mnt/data/end_side/model/Qwen2.5-0.5B-Instruct"
actor_model_path=$model
critic_model_path=$model

# export NPROC_PER_NODE=1
export NPROC_PER_NODE=8
export ROLLOUT_TP_SIZE=8
export NNODES=2

HYDRA_FULL_ERROR=1 
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
    data.train_files=$PROJ_HOME/data/gsm8k/train.parquet \
    data.val_files=$PROJ_HOME/data/gsm8k/test.parquet \
    data.train_batch_size=64 \
    data.max_prompt_length=512 \
    data.max_response_length=256 \
    actor_rollout_ref.model.path=$actor_model_path \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    critic.optim.lr=1e-5 \
    critic.model.path=$critic_model_path \
    critic.ppo_micro_batch_size_per_gpu=1 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.logger=['console'] \
    +trainer.val_before_train=False \
    trainer.default_hdfs_dir=null \
    trainer.n_gpus_per_node=$NPROC_PER_NODE \
    trainer.nnodes=${NNODES:-1} \
    trainer.save_freq=1000 \
    trainer.test_freq=1 \
    trainer.total_epochs=15 \
    trainer.logger=['console','wandb'] \
    trainer.project_name=${YOUR_PROJECT_NAME} \
    trainer.experiment_name=$YOUR_RUN_NAME 2>&1 | tee verl_demo_$(date +%m%d-%H%M).log

log:

[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     self.model_executor.sync_model_weights(actor_weights=actor_weights, load_format=load_format)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   File "/mnt/data/RL/verl/verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py", line 213, in sync_model_weights
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     self.worker.sync_model_weights(actor_weights=actor_weights, load_format=load_format)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   File "/mnt/data/RL/verl/verl/third_party/vllm/vllm_v_0_6_3/worker.py", line 281, in sync_model_weights
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     load_dtensor_weights(actor_weights, self.model_runner.model)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   File "/mnt/data/RL/verl/verl/third_party/vllm/vllm_v_0_6_3/dtensor_weight_loaders.py", line 368, in load_dtensor_weights
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     vllm_model = vllm_model.cuda()
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 916, in cuda
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     return self._apply(lambda t: t.cuda(device))
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     module._apply(fn)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     module._apply(fn)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     module._apply(fn)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   [Previous line repeated 2 more times]
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     param_applied = fn(param)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 916, in <lambda>
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w]     return self._apply(lambda t: t.cuda(device))
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 232.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 195.81 MiB is free. Process 3906 has 79.12 GiB memory in use. Of the allocated memory 76.78 GiB is allocated by PyTorch, and 82.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[stdout] [2025-03-25 15:14:28] [dlc14a09hk89ktle-submitter-bd29w] 

jijivski avatar Mar 25 '25 09:03 jijivski

If you want to use vLLM and 72B_model for sampling, a single card will result in an OOM (out of memory) error, but two cards will work.

Yusifu avatar Mar 26 '25 10:03 Yusifu

If you want to use vLLM and 72B_model for sampling, a single card will result in an OOM (out of memory) error, but two cards will work.

Hi, thank you for your clue, but I tried to give vllm 16 cards(also reduced max length), wondering is there any sample for 72B?

export ROLLOUT_TP_SIZE=16

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
    ...
    actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \

jijivski avatar Apr 01 '25 13:04 jijivski

If you want to use vLLM and 72B_model for sampling, a single card will result in an OOM (out of memory) error, but two cards will work.

Hi, thank you for your clue, but I tried to give vllm 16 cards(also reduced max length), wondering is there any sample for 72B?

export ROLLOUT_TP_SIZE=16

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
    ...
    actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \

Did you encounter OOM, or other errors? Cross mode TP rollout should be functional. Here's an example https://github.com/volcengine/verl/commit/7646e08fca74183baa2790690456a2aa7568fb55

eric-haibin-lin avatar Apr 01 '25 15:04 eric-haibin-lin

We provided some example script for models with different sizes here: https://verl.readthedocs.io/en/latest/perf/device_tuning.html

We are looking for contributors with more data points. Please submit PR if you can provide data point of new model sizes

eric-haibin-lin avatar Apr 11 '25 03:04 eric-haibin-lin

We provided some example script for models with different sizes here: https://verl.readthedocs.io/en/latest/perf/device_tuning.html

We are looking for contributors with more data points. Please submit PR if you can provide data point of new model sizes

Thanks for your reply! But the sample script you provided seems to be broken (404).

Image

JiyuanAn avatar Apr 15 '25 08:04 JiyuanAn

I'm excited to share that GPRO has been successfully validated on a 2-node H20 cluster (8 GPUs per node, 16 GPUs total, 96GB memory per GPU) using VERL. A detailed implementation report with performance benchmarks and lessons learned will be shared soon!

@none0663 How about this report? I am facing the same problem!

puppet101 avatar May 14 '25 02:05 puppet101