verl [Potential Bug] KL compute in `low_var_kl (Cause KL NaN and !!!!!!!!!!!!!!! output)

Hi,

A few days ago, I reported this issue in the discussion 751 and now there are some updates.

similar issue:

https://github.com/volcengine/verl/issues/747
https://github.com/volcengine/verl/issues/721
.......

After careful analysis, I found that the issue lies in the calculation of low_var_kl. In the problematic step, the loss has normal values (without NaN or Inf), but during backpropagation, the exp operation in ratio = torch.exp(kl) within low_var_kl （https://github.com/volcengine/verl/blob/8cae42dc29736d0802ded43c5ecf67a809d56bd8/verl/trainer/ppo/core_algos.py#L386）causes extreme values, leading to NaN in kl. My temporary solution is to use torch.clamp(kl, min=-5, max=5) to constrain kl before computing ratio. This approach works, but I feel uncertain about it. Could you kindly help confirm whether this is an appropriate solution?

Originally posted by @yushuiwx in #751

Apr 03 '25 08:04 yushuiwx

The final clamp does not take effect?

Apr 04 '25 01:04 vermouth1992

For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to https://github.com/volcengine/verl/issues/405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.

Apr 06 '25 11:04 great-luao

For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to #405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.

how did u fix it ?
2gpus,my setting is like this, actor_rollout_ref.actor.ppo_mini_batch_size=8
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \

Apr 10 '25 05:04 chuangzhidan

Hi,

A few days ago, I reported this issue in the discussion 751 and now there are some updates.

similar issue:

训练会突然在某个步骤失败，模型某个指标会变成 nan，奖励变成0，然后开始不停重复生成感叹号 !!!!!!!!! #747

output is !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #721

.......

After careful analysis, I found that the issue lies in the calculation of low_var_kl. In the problematic step, the loss has normal values (without NaN or Inf), but during backpropagation, the exp operation in ratio = torch.exp(kl) within low_var_kl （https://github.com/volcengine/verl/blob/8cae42dc29736d0802ded43c5ecf67a809d56bd8/verl/trainer/ppo/core_algos.py#L386）causes extreme values, leading to NaN in kl. My temporary solution is to use torch.clamp(kl, min=-5, max=5) to constrain kl before computing ratio. This approach works, but I feel uncertain about it. Could you kindly help confirm whether this is an appropriate solution?

_Originally posted by @yushuiwx in #751

i used torch.clamp(kl, min=-5, max=5),it didn't work ,sadlly

Apr 10 '25 05:04 chuangzhidan

依旧不行

Apr 10 '25 06:04 Xia723

For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to #405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.

how did u fix it ? 2gpus,my setting is like this, actor_rollout_ref.actor.ppo_mini_batch_size=8 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \

Check the output during training, are you also having training grad_norm = nan after the first step? If so I think it's better to check if the loss is calculated right. If it's not then I guess we are not in the same case.

Apr 10 '25 10:04 great-luao

Thanks for the solution, but unfortunately in my case, it seems to address one consequence rather than the root cause. Even when it's constrained and no longer producesnan, the training still fails.

Apr 14 '25 16:04 water-vapor

i turn off use_kl_loss ,and it's still not working,so i switched to openrlhf

Apr 15 '25 03:04 chuangzhidan

Thanks for sharing! I had the same issue and your torch.clamp fix worked for me too.

Apr 15 '25 08:04 DEM1TASSE

For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to #405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.

How did you solve it exactly again? Thanks!

Apr 23 '25 18:04 lynnliu030

我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题，经人提醒，将vllm的v1引擎关掉之后，一切正常了。具体操作是，设置export VLLM_USE_V1=0，并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False
就可以了

May 07 '25 11:05 takagi97

我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题，经人提醒，将vllm的v1引擎关掉之后，一切正常了。具体操作是，设置export VLLM_USE_V1=0，并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了

请问这个是修复了什么问题？我现在是一直到平台，grad norm 会变成nan

May 07 '25 13:05 takfate

我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题，经人提醒，将vllm的v1引擎关掉之后，一切正常了。具体操作是，设置export VLLM_USE_V1=0，并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了

请问这个是修复了什么问题？我现在是一直到平台，grad norm 会变成nan

我没有仔细研究过，我看到过别的issue讨论过这种突然训练爆炸有一种成因是使用当前policy和old policy计算采样出来的sample的概率的差别过大，导致算出来的ratio数值爆炸。不清楚是不是这个原因，但是我这边v0能跑v1跑不了，估计是v1的bug可能引起上述问题。

May 14 '25 12:05 takagi97

marked, normal issue

May 30 '25 13:05 141forever

我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题，经人提醒，将vllm的v1引擎关掉之后，一切正常了。具体操作是，设置export VLLM_USE_V1=0，并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了

我这边4卡24g的，只设置actor_rollout_ref.rollout.enforce_eager=False 和actor_rollout_ref.rollout.free_cache_engine=False 就有效

Aug 28 '25 03:08 HollrayChan