verl icon indicating copy to clipboard operation
verl copied to clipboard

Actor grad Norms are always NaN (GRPO Training)

Open AlanShao-zy opened this issue 8 months ago • 5 comments

I tried to run the GRPO training example and found that the actor grad norms were always NaN.

AlanShao-zy avatar Apr 22 '25 05:04 AlanShao-zy

GPUs:

Driver Version: 550.127.08     
CUDA Version: 12.4  
GPUs: 4 *  NVIDIA A10 

Deps:

Python 3.11.11 

name         : transformers                                                                                                                                                                                                                                      
version      : 4.51.1                                                                                                                                                                                                                          
description  : State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow  

 name         : torch                            
 version      : 2.6.0                                                                                  
 description  : Tensors and Dynamic neural networks in Python with strong GPU acceleration   

 name         : vllm                         
 version      : 0.8.4                                                                                                                    
 description  : A high-throughput and memory-efficient inference and serving engine for LLMs 

 name         : ray                                                                                                                                                                                                                             
 version      : 2.43.0                                                                           
 description  : Ray provides a simple, universal API for building distributed applications

Name: flash_attn                                                                                                                                                                     
Version: 2.7.4.post1                                                                                                                                                                 
Summary: Flash Attention: Fast and Memory-Efficient Exact Attention  

 name         : verl                                                                                                                                                                                                                                        
 version      : 0.2.0.dev0                                                                                                                                                                                                                            
 description  : verl: Volcano Engine Reinforcement Learning for LLM  

Recipe:

set -x

MODEL_PATH=/nas/llm_weights/qwen2.5-7b-it
DATA_PATH=/nas/llm_weights/datasets/gsm8k-verl
RF_PATH=3rdparty/verl/verl/utils/reward_score/gsm8k.py
RF_NAME=compute_score
MIN_BS_PER_GPU=16
MIN_BS_LOGPROB_PER_GPU=32
PPO_MINI_BS=16
DATA_TRAIN_BS=64
N=5
TP=4
GMU=0.7
EPOCH=15
STEPS=1000
LR=4e-7
MAX_REP_LEN=1024
DATASET=gsm8k_$(date +%Y%m%d_%H%M%S)
DY_BS=false


if [ "$DY_BS" = "false" ]; then
    poetry run python -m verl.trainer.main_ppo \
        algorithm.adv_estimator=grpo \
        data.train_files=$DATA_PATH/train.parquet   \
        data.val_files=$DATA_PATH/test.parquet  \
        data.train_batch_size=$DATA_TRAIN_BS \
        data.max_prompt_length=512 \
        data.max_response_length=$MAX_REP_LEN \
        data.filter_overlong_prompts=True \
        data.truncation='error' \
        custom_reward_function.path=$RF_PATH\
        custom_reward_function.name=$RF_NAME\
        actor_rollout_ref.model.path=$MODEL_PATH \
        actor_rollout_ref.actor.optim.lr=$LR \
        actor_rollout_ref.actor.grad_clip=1.0 \
        actor_rollout_ref.model.use_remove_padding=True \
        actor_rollout_ref.actor.ppo_mini_batch_size=$PPO_MINI_BS \
        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=$MIN_BS_PER_GPU\
        actor_rollout_ref.actor.use_kl_loss=True \
        actor_rollout_ref.actor.kl_loss_coef=0.001 \
        actor_rollout_ref.actor.kl_loss_type=low_var_kl \
        actor_rollout_ref.model.enable_gradient_checkpointing=True \
        actor_rollout_ref.actor.fsdp_config.param_offload=True \
        actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=$MIN_BS_LOGPROB_PER_GPU \
        actor_rollout_ref.rollout.tensor_model_parallel_size=$TP \
        actor_rollout_ref.rollout.name=vllm \
        actor_rollout_ref.rollout.gpu_memory_utilization=$GMU \
        actor_rollout_ref.rollout.n=$N \
        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=$MIN_BS_LOGPROB_PER_GPU \
        actor_rollout_ref.ref.fsdp_config.param_offload=True \
        actor_rollout_ref.rollout.enforce_eager=False \
        actor_rollout_ref.rollout.free_cache_engine=False \
        algorithm.use_kl_in_reward=False \
        trainer.critic_warmup=0 \
        trainer.logger=['console','wandb'] \
        trainer.project_name=$PROJECT_NAME \
        trainer.experiment_name=$EXPERIMENT_NAME \
        trainer.n_gpus_per_node=8 \
        trainer.nnodes=1 \
        trainer.default_hdfs_dir=null \
        trainer.default_local_dir=checkpoints/grpo/$DATASET \
        trainer.save_freq=-1 \
        trainer.test_freq=10 \
        trainer.total_epochs=$EPOCH 
else
    echo err
fi

AlanShao-zy avatar Apr 22 '25 05:04 AlanShao-zy

I am facing the same problem. The grad_norm is na after 50 steps.

ChaseChenNLP avatar Apr 22 '25 08:04 ChaseChenNLP

The DATA_TRAIN_BS is too small. Try increase it to 1024. We will try to investigate it when the number is small.

vermouth1992 avatar Apr 22 '25 13:04 vermouth1992

The DATA_TRAIN_BS is too small. Try increase it to 1024. We will try to investigate it when the number is small.

This doesn't work for me in my custom dataset.

zihaolucky avatar May 11 '25 07:05 zihaolucky

any updating information on this issue? i am facing the same issue when training GRPO on qwen2-vl

wangskyGit avatar May 26 '25 12:05 wangskyGit

any updates? same issue when training after about 100 steps

Juanerx avatar Jun 07 '25 15:06 Juanerx

Could you please try this solution? https://github.com/volcengine/verl/pull/1779

vermouth1992 avatar Jun 07 '25 15:06 vermouth1992

I am facing the same problem. The grad_norm is na after 50 steps.

same

Vilonge avatar Jul 04 '25 07:07 Vilonge

same after 80 steps for multi-turn grpo training on qwen2.5-instruct-3b

threegold116 avatar Jul 25 '25 02:07 threegold116

same after 80 steps for multi-turn grpo training on qwen2.5-instruct-3b

after try warmup=0.285, the phenomenon is alleviated

threegold116 avatar Jul 31 '25 02:07 threegold116