yiy comments

Repositories
Issues
Comments

Results 4 comments of

yiy

For small models, actor, critic, init-actor, and reward can be loaded on a single machine. However, how to build the PPO process for LLM?

> TRL uses `accelerate` as its backend and as such support multi-GPU training but via data parallelism. That means the model still needs to be loaded on a single machine....

PPO-max 对比原始PPO 的效果

有尝试训练更多的step吗？

PPO-max 对比原始PPO 的效果

norm+clip的配置是否只会减缓这个问题的出现【作用和减小lr是一致的吗】。训练更多的step，仍会收敛到max-length上。

[v2] Attention Masking

flash_attn/flash_attn_triton.py support bias input you can use bias=-inf

yiy

For small models, actor, critic, init-actor, and reward can be loaded on a single machine. However, how to build the PPO process for LLM?

PPO-max 对比 原始PPO 的效果

PPO-max 对比 原始PPO 的效果

[v2] Attention Masking

PPO-max 对比原始PPO 的效果

PPO-max 对比原始PPO 的效果