verl Support FSDP worker and vLLM Ascend

This pr is committed for supporting Ascend NPU backend. Co-authored-by: Chendong98 [email protected] Co-authored-by: zheliuyu [email protected] Co-authored-by: celestialli [email protected] In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU.

These are change lists:

pyproject.toml change verison of vllm
requirements-npu.txt requirements for NPU
verl/bert_padding.py Adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
verl/single_controller/ray/base.py
verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py
verl/trainer/fsdp_sft_trainer.py
verl/utils/flops_counter.py
verl/utils/fsdp_utils.py
verl/workers/actor/dp_actor.py
verl/workers/critic/dp_critic.py
verl/workers/fsdp_workers.py
verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
verl/workers/sharding_manager/fsdp_vllm.py
verl/utils/device.py get device type for different device
docs/ascend/ascend.md

Here are our roadmap:

RoadMap

[x] sft
[x] ppo
[x] grpo

News

[2025.03.03] Modify the adaptation method of Ray

[2025.02.25] The PPO algorithm is supported for training on NPU with the FSDP backend.

[2025.02.23] The SFT algorithm is supported for training on NPU with the FSDP backend.

[2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend.

Requirements We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table.

Software	Version
transformers	4.47.1
accelerate	1.3.0
torch_npu	2.5.1.rc1
CANN	8.1.RC1 (Not Released)

About mean error Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows. loss_comparison

N represents the number of training steps. For more information, please refer to Calculation accuracy description

Feb 21 '25 03:02 sunyi0505

does this pr work on multi nodes?

Feb 21 '25 06:02 huangk10

does this pr work on multi nodes?

I am currently conducting tests on a single node only, and will subsequently supplement with multi-node testing results.

Feb 21 '25 07:02 sunyi0505

All committers have signed the CLA.

Feb 26 '25 00:02 CLAassistant

@as12138 Hello. Thanks for your efforts! Can this PR be directly implemented on 910b2c 64GB now?

Mar 05 '25 07:03 takagi97

@as12138 Hello. Thanks for your efforts! Can this PR be directly implemented on 910b2c 64GB now?

I tested it on the Atlas 800T A2 64G ASCEND+ARM, and it passed, If you're interested, you can verify it, and if you encounter any issues, please feel free to reach out to me.

Mar 05 '25 07:03 sunyi0505

@as12138 Hello. Thanks for your efforts! Can this PR be directly implemented on 910b2c 64GB now?

I tested it on the Atlas 800T A2 64G ASCEND+ARM, and it passed, If you're interested, you can verify it, and if you encounter any issues, please feel free to reach out to me.

Thank you for your quick response! I will try it.

Mar 05 '25 09:03 takagi97

@eric-haibin-lin can you review the pr?

Mar 13 '25 08:03 sunyi0505

Is CANN 8.1.RC1 (Not Released) mandatory? Have you tested PPO and GRPO on 8.0.RC3?

Mar 25 '25 07:03 jianzhnie

Is CANN 8.1.RC1 (Not Released) mandatory? Have you tested PPO and GRPO on 8.0.RC3?

It is mandatory, I have not tested it on 8.0.RC3

Mar 26 '25 01:03 sunyi0505

use_remove_padding is not supported on ASCEND NPU now.

Mar 27 '25 09:03 sunyi0505

is CANN 8.1.RC1 released now?

It is not released now.

Mar 31 '25 02:03 sunyi0505

We have tested SFT, GRPO algorithms on Ascend NPU currently.

Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it, the critic/rewards/mean differences less than 4% is acceptable. If the critic/rewards/mean difference is greater than 4%, we will try to fix it. The calculation formula is as follows. loss_comparison

N represents the number of training steps.

Software	Version
transformers	4.49.0
torch_npu	2.5.1.rc1
CANN	8.1.RC1 (Not Released)

Here are training scripts and loss comparison graphs.

For SFT:

# Tested with 1 & 8 NPUs

set -x

if [ "$#" -lt 2 ]; then
    echo "Usage: run_qwen_05_peft.sh <nproc_per_node> <save_path> [other_configs...]"
    exit 1
fi

nproc_per_node=$1
save_path=$2

# Shift the arguments so $@ refers to the rest
shift 2

torchrun --standalone --nnodes=1 --nproc_per_node=$nproc_per_node \
     -m verl.trainer.fsdp_sft_trainer \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.prompt_key=extra_info \
    data.response_key=extra_info \
    data.train_batch_size=512 \
    optim.lr=1e-4 \
    +data.prompt_dict_keys=['question'] \
    +data.response_dict_keys=['answer'] \
    data.micro_batch_size_per_gpu=4 \
    model.partial_pretrain=Qwen/Qwen2.5-0.5B-Instruct \
    trainer.default_local_dir=$save_path \
    trainer.project_name=gsm8k-sft \
    trainer.experiment_name=gsm8k-sft-qwen-2.5-0.5b-instruct \
    trainer.logger=['console'] \
    trainer.total_epochs=1 \
    trainer.default_hdfs_dir=null $@ \
    model.lora_rank=32\
    model.lora_alpha=16 \
    model.target_modules=all-linear

    # Or you can do this:
    # model.target_modules=[q_proj,v_proj] \

sft

For GRPO:

Parameters change information:

data.train_batch_size 1024 -> 16
actor_rollout_ref.actor.optim.lr 1e-6 -> 5e-7
critic.optim.lr 1e-5 -> 9e-6
actor_rollout_ref.actor.ppo_max_token_len_per_gpu 16384 -> 2048
actor_rollout_ref.model.use_remove_padding True -> False
actor_rollout_ref.actor.ppo_mini_batch_size 256 -> 64
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu 80 -> 8
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu 160 -> 80
actor_rollout_ref.rollout.tensor_model_parallel_size 2 -> 4
actor_rollout_ref.rollout.gpu_memory_utilization 0.6 -> 0.2
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu 160 -> 80
actor_rollout_ref.rollout.enable_chunked_prefill True -> False
trainer.nnodes 1 -> 2

# Tested with 2 & 8 NPUs
set -x

export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=16 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=5e-7 \
    critic.optim.lr=9e-6 \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=2048 \
    actor_rollout_ref.model.use_remove_padding=False \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=80 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=80 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_grpo_example_gsm8k' \
    trainer.experiment_name='qwen2_7b_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2\
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 $@

critic/rewards/mean: grpo