verl icon indicating copy to clipboard operation
verl copied to clipboard

[Potential Bug] KL compute in `low_var_kl (Cause KL NaN and !!!!!!!!!!!!!!! output)

Open yushuiwx opened this issue 8 months ago • 14 comments

Hi,

A few days ago, I reported this issue in the discussion 751 and now there are some updates.

similar issue:

  • https://github.com/volcengine/verl/issues/747
  • https://github.com/volcengine/verl/issues/721
  • .......

After careful analysis, I found that the issue lies in the calculation of low_var_kl. In the problematic step, the loss has normal values (without NaN or Inf), but during backpropagation, the exp operation in ratio = torch.exp(kl) within low_var_kl (https://github.com/volcengine/verl/blob/8cae42dc29736d0802ded43c5ecf67a809d56bd8/verl/trainer/ppo/core_algos.py#L386)causes extreme values, leading to NaN in kl. My temporary solution is to use torch.clamp(kl, min=-5, max=5) to constrain kl before computing ratio. This approach works, but I feel uncertain about it. Could you kindly help confirm whether this is an appropriate solution?

Originally posted by @yushuiwx in #751

yushuiwx avatar Apr 03 '25 08:04 yushuiwx

The final clamp does not take effect?

vermouth1992 avatar Apr 04 '25 01:04 vermouth1992

For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to https://github.com/volcengine/verl/issues/405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.

Image

great-luao avatar Apr 06 '25 11:04 great-luao

For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to #405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.

Image

how did u fix it ?
2gpus,my setting is like this, actor_rollout_ref.actor.ppo_mini_batch_size=8
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \

chuangzhidan avatar Apr 10 '25 05:04 chuangzhidan

Hi,

A few days ago, I reported this issue in the discussion 751 and now there are some updates.

similar issue:

After careful analysis, I found that the issue lies in the calculation of low_var_kl. In the problematic step, the loss has normal values (without NaN or Inf), but during backpropagation, the exp operation in ratio = torch.exp(kl) within low_var_klhttps://github.com/volcengine/verl/blob/8cae42dc29736d0802ded43c5ecf67a809d56bd8/verl/trainer/ppo/core_algos.py#L386)causes extreme values, leading to NaN in kl. My temporary solution is to use torch.clamp(kl, min=-5, max=5) to constrain kl before computing ratio. This approach works, but I feel uncertain about it. Could you kindly help confirm whether this is an appropriate solution?

_Originally posted by @yushuiwx in #751

i used torch.clamp(kl, min=-5, max=5),it didn't work ,sadlly

chuangzhidan avatar Apr 10 '25 05:04 chuangzhidan

依旧不行

Xia723 avatar Apr 10 '25 06:04 Xia723

For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to #405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice. Image

how did u fix it ? 2gpus,my setting is like this, actor_rollout_ref.actor.ppo_mini_batch_size=8 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \

Check the output during training, are you also having training grad_norm = nan after the first step? If so I think it's better to check if the loss is calculated right. If it's not then I guess we are not in the same case.

great-luao avatar Apr 10 '25 10:04 great-luao

Thanks for the solution, but unfortunately in my case, it seems to address one consequence rather than the root cause. Even when it's constrained and no longer producesnan, the training still fails.

water-vapor avatar Apr 14 '25 16:04 water-vapor

i turn off use_kl_loss ,and it's still not working,so i switched to openrlhf

chuangzhidan avatar Apr 15 '25 03:04 chuangzhidan

Thanks for sharing! I had the same issue and your torch.clamp fix worked for me too.

DEM1TASSE avatar Apr 15 '25 08:04 DEM1TASSE

For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to #405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.

Image

How did you solve it exactly again? Thanks!

lynnliu030 avatar Apr 23 '25 18:04 lynnliu030

我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。 具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False
就可以了

takagi97 avatar May 07 '25 11:05 takagi97

我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。 具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了

请问这个是修复了什么问题?我现在是一直到平台,grad norm 会变成nan

takfate avatar May 07 '25 13:05 takfate

我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。 具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了

请问这个是修复了什么问题?我现在是一直到平台,grad norm 会变成nan

我没有仔细研究过,我看到过别的issue讨论过这种突然训练爆炸有一种成因是使用当前policy和old policy计算采样出来的sample的概率的差别过大,导致算出来的ratio数值爆炸。不清楚是不是这个原因,但是我这边v0能跑v1跑不了,估计是v1的bug可能引起上述问题。

takagi97 avatar May 14 '25 12:05 takagi97

marked, normal issue

141forever avatar May 30 '25 13:05 141forever

我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。 具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了

我这边4卡24g的,只设置actor_rollout_ref.rollout.enforce_eager=False 和actor_rollout_ref.rollout.free_cache_engine=False 就有效

HollrayChan avatar Aug 28 '25 03:08 HollrayChan