yiy

Results 4 comments of yiy

> TRL uses `accelerate` as its backend and as such support multi-GPU training but via data parallelism. That means the model still needs to be loaded on a single machine....

有尝试训练更多的step吗?

norm+clip的配置是否只会减缓这个问题的出现【作用和减小lr是一致的吗】。训练更多的step,仍会收敛到max-length上。

flash_attn/flash_attn_triton.py support bias input you can use bias=-inf