Actor grad Norms are always NaN (GRPO Training)
I tried to run the GRPO training example and found that the actor grad norms were always NaN.
GPUs:
Driver Version: 550.127.08
CUDA Version: 12.4
GPUs: 4 * NVIDIA A10
Deps:
Python 3.11.11
name : transformers
version : 4.51.1
description : State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
name : torch
version : 2.6.0
description : Tensors and Dynamic neural networks in Python with strong GPU acceleration
name : vllm
version : 0.8.4
description : A high-throughput and memory-efficient inference and serving engine for LLMs
name : ray
version : 2.43.0
description : Ray provides a simple, universal API for building distributed applications
Name: flash_attn
Version: 2.7.4.post1
Summary: Flash Attention: Fast and Memory-Efficient Exact Attention
name : verl
version : 0.2.0.dev0
description : verl: Volcano Engine Reinforcement Learning for LLM
Recipe:
set -x
MODEL_PATH=/nas/llm_weights/qwen2.5-7b-it
DATA_PATH=/nas/llm_weights/datasets/gsm8k-verl
RF_PATH=3rdparty/verl/verl/utils/reward_score/gsm8k.py
RF_NAME=compute_score
MIN_BS_PER_GPU=16
MIN_BS_LOGPROB_PER_GPU=32
PPO_MINI_BS=16
DATA_TRAIN_BS=64
N=5
TP=4
GMU=0.7
EPOCH=15
STEPS=1000
LR=4e-7
MAX_REP_LEN=1024
DATASET=gsm8k_$(date +%Y%m%d_%H%M%S)
DY_BS=false
if [ "$DY_BS" = "false" ]; then
poetry run python -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$DATA_PATH/train.parquet \
data.val_files=$DATA_PATH/test.parquet \
data.train_batch_size=$DATA_TRAIN_BS \
data.max_prompt_length=512 \
data.max_response_length=$MAX_REP_LEN \
data.filter_overlong_prompts=True \
data.truncation='error' \
custom_reward_function.path=$RF_PATH\
custom_reward_function.name=$RF_NAME\
actor_rollout_ref.model.path=$MODEL_PATH \
actor_rollout_ref.actor.optim.lr=$LR \
actor_rollout_ref.actor.grad_clip=1.0 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=$PPO_MINI_BS \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=$MIN_BS_PER_GPU\
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=$MIN_BS_LOGPROB_PER_GPU \
actor_rollout_ref.rollout.tensor_model_parallel_size=$TP \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=$GMU \
actor_rollout_ref.rollout.n=$N \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=$MIN_BS_LOGPROB_PER_GPU \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.rollout.free_cache_engine=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name=$PROJECT_NAME \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.default_hdfs_dir=null \
trainer.default_local_dir=checkpoints/grpo/$DATASET \
trainer.save_freq=-1 \
trainer.test_freq=10 \
trainer.total_epochs=$EPOCH
else
echo err
fi
I am facing the same problem. The grad_norm is na after 50 steps.
The DATA_TRAIN_BS is too small. Try increase it to 1024. We will try to investigate it when the number is small.
The DATA_TRAIN_BS is too small. Try increase it to 1024. We will try to investigate it when the number is small.
This doesn't work for me in my custom dataset.
any updating information on this issue? i am facing the same issue when training GRPO on qwen2-vl
any updates? same issue when training after about 100 steps
Could you please try this solution? https://github.com/volcengine/verl/pull/1779
I am facing the same problem. The grad_norm is na after 50 steps.
same
same after 80 steps for multi-turn grpo training on qwen2.5-instruct-3b
same after 80 steps for multi-turn grpo training on qwen2.5-instruct-3b
after try warmup=0.285, the phenomenon is alleviated