[Potential Bug] KL compute in `low_var_kl (Cause KL NaN and !!!!!!!!!!!!!!! output)
Hi,
A few days ago, I reported this issue in the discussion 751 and now there are some updates.
similar issue:
- https://github.com/volcengine/verl/issues/747
- https://github.com/volcengine/verl/issues/721
- .......
After careful analysis, I found that the issue lies in the calculation of low_var_kl. In the problematic step, the loss has normal values (without NaN or Inf), but during backpropagation, the exp operation in ratio = torch.exp(kl) within low_var_kl (https://github.com/volcengine/verl/blob/8cae42dc29736d0802ded43c5ecf67a809d56bd8/verl/trainer/ppo/core_algos.py#L386)causes extreme values, leading to NaN in kl. My temporary solution is to use torch.clamp(kl, min=-5, max=5) to constrain kl before computing ratio. This approach works, but I feel uncertain about it. Could you kindly help confirm whether this is an appropriate solution?
Originally posted by @yushuiwx in #751
The final clamp does not take effect?
For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to https://github.com/volcengine/verl/issues/405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.
For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to #405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.
how did u fix it ?
2gpus,my setting is like this,
actor_rollout_ref.actor.ppo_mini_batch_size=8
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
Hi,
A few days ago, I reported this issue in the discussion 751 and now there are some updates.
similar issue:
- 训练会突然在某个步骤失败,模型某个指标会变成 nan,奖励变成0,然后开始不停重复生成感叹号 !!!!!!!!! #747
- output is !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #721
- .......
After careful analysis, I found that the issue lies in the calculation of
low_var_kl. In the problematic step, the loss has normal values (without NaN or Inf), but during backpropagation, theexpoperation inratio = torch.exp(kl)withinlow_var_kl(https://github.com/volcengine/verl/blob/8cae42dc29736d0802ded43c5ecf67a809d56bd8/verl/trainer/ppo/core_algos.py#L386)causes extreme values, leading to NaN inkl. My temporary solution is to usetorch.clamp(kl, min=-5, max=5)to constrainklbefore computingratio. This approach works, but I feel uncertain about it. Could you kindly help confirm whether this is an appropriate solution?
i used torch.clamp(kl, min=-5, max=5),it didn't work ,sadlly
依旧不行
For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to #405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.
how did u fix it ? 2gpus,my setting is like this, actor_rollout_ref.actor.ppo_mini_batch_size=8 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
Check the output during training, are you also having training grad_norm = nan after the first step? If so I think it's better to check if the loss is calculated right. If it's not then I guess we are not in the same case.
Thanks for the solution, but unfortunately in my case, it seems to address one consequence rather than the root cause. Even when it's constrained and no longer producesnan, the training still fails.
i turn off use_kl_loss ,and it's still not working,so i switched to openrlhf
Thanks for sharing! I had the same issue and your torch.clamp fix worked for me too.
For me I always have the similar !!! problem after the first training step. I found that my training grad_norm is 0, which caused the training to crash, I solve this issue refer to #405. Hope this will help somebody lol. Besides, this leads to another issue that the assertion added actually won't work since we don't use "ppo_micro_batch_size" anymore. The assert code will be skipped and no one can notice.
How did you solve it exactly again? Thanks!
我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。
具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置
actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False
就可以了
我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。 具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了
请问这个是修复了什么问题?我现在是一直到平台,grad norm 会变成nan
我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。 具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了
请问这个是修复了什么问题?我现在是一直到平台,grad norm 会变成nan
我没有仔细研究过,我看到过别的issue讨论过这种突然训练爆炸有一种成因是使用当前policy和old policy计算采样出来的sample的概率的差别过大,导致算出来的ratio数值爆炸。不清楚是不是这个原因,但是我这边v0能跑v1跑不了,估计是v1的bug可能引起上述问题。
marked, normal issue
我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。 具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了
我这边4卡24g的,只设置actor_rollout_ref.rollout.enforce_eager=False 和actor_rollout_ref.rollout.free_cache_engine=False 就有效