yiy
Results
4
comments of
yiy
> TRL uses `accelerate` as its backend and as such support multi-GPU training but via data parallelism. That means the model still needs to be loaded on a single machine....
有尝试训练更多的step吗?
norm+clip的配置是否只会减缓这个问题的出现【作用和减小lr是一致的吗】。训练更多的step,仍会收敛到max-length上。
flash_attn/flash_attn_triton.py support bias input you can use bias=-inf