verl weird training result when switched from qwen2.5-7b-instruct to qwen3-8b, the accuracy is 0 with nonsense output

I are using the latest verl, and here is my training script:

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=/raphealhuang/simpleRL/train.parquet
data.val_files=/raphealhuang/simpleRL/test.parquet
data.train_batch_size=1024
data.max_prompt_length=512
data.max_response_length=1024
data.filter_overlong_prompts=True
data.truncation='error'
actor_rollout_ref.model.path=/raphealhuang/models/Qwen3-8B
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=384
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=12
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.0001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=80
actor_rollout_ref.rollout.tensor_model_parallel_size=4
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.6
actor_rollout_ref.rollout.n=6
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=80
actor_rollout_ref.ref.fsdp_config.param_offload=True
algorithm.use_kl_in_reward=False
custom_reward_function.path=/raphealhuang/verl/verl/utils/reward_score/simplelr_qwen.py
trainer.critic_warmup=0
trainer.logger=['console','wandb']
trainer.project_name='verl_train'
trainer.experiment_name='grpo_qwen3_8b'
trainer.n_gpus_per_node=8
trainer.nnodes=4
trainer.save_freq=5
trainer.test_freq=5
trainer.resume_mode='auto'
trainer.default_local_dir=/raphealhuang/models/checkpoints/${PROJECT_NAME}/${EXPERIMENT_NAME}
trainer.max_actor_ckpt_to_keep=2
trainer.max_critic_ckpt_to_keep=2
trainer.total_epochs=100 $@

the output of the validation generations of qwen3 are of non-sense and the accuracy is always 0

I have looked into the script provided as the example, but found nothing different to cause the training failure

examples/grpo_trainer/run_qwen3-8b.sh

set -x

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=$HOME/data/gsm8k/train.parquet
data.val_files=$HOME/data/gsm8k/test.parquet
data.train_batch_size=1024
data.max_prompt_length=512
data.max_response_length=1024
data.filter_overlong_prompts=True
data.truncation='error'
actor_rollout_ref.model.path=Qwen/Qwen3-8B
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=256
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=32
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.6
actor_rollout_ref.rollout.n=5
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32
actor_rollout_ref.ref.fsdp_config.param_offload=True
algorithm.use_kl_in_reward=False
trainer.critic_warmup=0
trainer.logger=['console','wandb']
trainer.project_name='verl_grpo_example_gsm8k'
trainer.experiment_name='qwen3_8b_function_rm'
trainer.n_gpus_per_node=8
trainer.nnodes=1
trainer.save_freq=20
trainer.test_freq=5
trainer.total_epochs=15 $@

May 21 '25 05:05 Raphrain

same problem

May 26 '25 16:05 0205090923

anybody solved the problem?

Jun 08 '25 23:06 tjoymeed

just ran into this problem--try updating your vllm

minimum vllm version needs to be 0.8.4 for qwen3. i was running 0.8.2 and was running into this output. fixed itself after updating to 0.8.4. hope this helps!

Jul 10 '25 07:07 maxx06