verl Actor grad Norms are always NaN (GRPO Training)

I tried to run the GRPO training example and found that the actor grad norms were always NaN.

Apr 22 '25 05:04 AlanShao-zy

GPUs:

Driver Version: 550.127.08     
CUDA Version: 12.4  
GPUs: 4 *  NVIDIA A10

Deps:

Python 3.11.11 

name         : transformers                                                                                                                                                                                                                                      
version      : 4.51.1                                                                                                                                                                                                                          
description  : State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow  

 name         : torch                            
 version      : 2.6.0                                                                                  
 description  : Tensors and Dynamic neural networks in Python with strong GPU acceleration   

 name         : vllm                         
 version      : 0.8.4                                                                                                                    
 description  : A high-throughput and memory-efficient inference and serving engine for LLMs 

 name         : ray                                                                                                                                                                                                                             
 version      : 2.43.0                                                                           
 description  : Ray provides a simple, universal API for building distributed applications

Name: flash_attn                                                                                                                                                                     
Version: 2.7.4.post1                                                                                                                                                                 
Summary: Flash Attention: Fast and Memory-Efficient Exact Attention  

 name         : verl                                                                                                                                                                                                                                        
 version      : 0.2.0.dev0                                                                                                                                                                                                                            
 description  : verl: Volcano Engine Reinforcement Learning for LLM

Recipe:

set -x

MODEL_PATH=/nas/llm_weights/qwen2.5-7b-it
DATA_PATH=/nas/llm_weights/datasets/gsm8k-verl
RF_PATH=3rdparty/verl/verl/utils/reward_score/gsm8k.py
RF_NAME=compute_score
MIN_BS_PER_GPU=16
MIN_BS_LOGPROB_PER_GPU=32
PPO_MINI_BS=16
DATA_TRAIN_BS=64
N=5
TP=4
GMU=0.7
EPOCH=15
STEPS=1000
LR=4e-7
MAX_REP_LEN=1024
DATASET=gsm8k_$(date +%Y%m%d_%H%M%S)
DY_BS=false


if [ "$DY_BS" = "false" ]; then
    poetry run python -m verl.trainer.main_ppo \
        algorithm.adv_estimator=grpo \
        data.train_files=$DATA_PATH/train.parquet   \
        data.val_files=$DATA_PATH/test.parquet  \
        data.train_batch_size=$DATA_TRAIN_BS \
        data.max_prompt_length=512 \
        data.max_response_length=$MAX_REP_LEN \
        data.filter_overlong_prompts=True \
        data.truncation='error' \
        custom_reward_function.path=$RF_PATH\
        custom_reward_function.name=$RF_NAME\
        actor_rollout_ref.model.path=$MODEL_PATH \
        actor_rollout_ref.actor.optim.lr=$LR \
        actor_rollout_ref.actor.grad_clip=1.0 \
        actor_rollout_ref.model.use_remove_padding=True \
        actor_rollout_ref.actor.ppo_mini_batch_size=$PPO_MINI_BS \
        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=$MIN_BS_PER_GPU\
        actor_rollout_ref.actor.use_kl_loss=True \
        actor_rollout_ref.actor.kl_loss_coef=0.001 \
        actor_rollout_ref.actor.kl_loss_type=low_var_kl \
        actor_rollout_ref.model.enable_gradient_checkpointing=True \
        actor_rollout_ref.actor.fsdp_config.param_offload=True \
        actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=$MIN_BS_LOGPROB_PER_GPU \
        actor_rollout_ref.rollout.tensor_model_parallel_size=$TP \
        actor_rollout_ref.rollout.name=vllm \
        actor_rollout_ref.rollout.gpu_memory_utilization=$GMU \
        actor_rollout_ref.rollout.n=$N \
        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=$MIN_BS_LOGPROB_PER_GPU \
        actor_rollout_ref.ref.fsdp_config.param_offload=True \
        actor_rollout_ref.rollout.enforce_eager=False \
        actor_rollout_ref.rollout.free_cache_engine=False \
        algorithm.use_kl_in_reward=False \
        trainer.critic_warmup=0 \
        trainer.logger=['console','wandb'] \
        trainer.project_name=$PROJECT_NAME \
        trainer.experiment_name=$EXPERIMENT_NAME \
        trainer.n_gpus_per_node=8 \
        trainer.nnodes=1 \
        trainer.default_hdfs_dir=null \
        trainer.default_local_dir=checkpoints/grpo/$DATASET \
        trainer.save_freq=-1 \
        trainer.test_freq=10 \
        trainer.total_epochs=$EPOCH 
else
    echo err
fi

Apr 22 '25 05:04 AlanShao-zy

I am facing the same problem. The grad_norm is na after 50 steps.