MYY
MYY
I imitated this code and wrote a Pytorch-based version, and used Resnet101 to extract features, only to get a score close to that in the paper. ratio: 1.01423555697 Bleu_1: 0.708868328298...
@Yikun Thanks for your efforts! Can PR https://github.com/volcengine/verl/pull/332 be directly implemented on 910b2c 64GB?
> Yes, the initial support will based on community CI which are running on Altlas A2 series, for more info pls see CI info: https://github.com/volcengine/verl/actions/runs/13649493801/job/38154761144 Thank you for your quick...
Is my foundational env (including driver, nccl, etc) corrupted? I am training LLMs which is not related to VLLM, and it is showing new warnings. @DarkLight1337 ``` [WARNING] async_io requires...
> @DarkLight1337 I'm not sure about the specific error, but it looks like something is wrong with the `nvidia-nccl-cu12` dependency. > > @takagi97 could you reinstall `vllm` in a fresh...
I have tested the similar conda env on a new server. My conclusion is that the following error comes from conda env itself: ``` [WARNING] async_io requires the dev libaio...
> I have tested the similar conda env on a new server. My conclusion is that the following error comes from conda env itself: > > ``` > [WARNING] async_io...
我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。 具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.free_cache_engine=False \ 就可以了
> > 我用vllm 0.8.2 + 2节点跑的时候也遇到这个问题,经人提醒,将vllm的v1引擎关掉之后,一切正常了。 具体操作是,设置export VLLM_USE_V1=0,并且训练参数里面不要使用设置 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False 就可以了 > > 请问这个是修复了什么问题?我现在是一直到平台,grad norm 会变成nan 我没有仔细研究过,我看到过别的issue讨论过这种突然训练爆炸有一种成因是使用当前policy和old policy计算采样出来的sample的概率的差别过大,导致算出来的ratio数值爆炸。不清楚是不是这个原因,但是我这边v0能跑v1跑不了,估计是v1的bug可能引起上述问题。