Deng Dong
Results
1
comments of
Deng Dong
I've also encoutered this problem when i trained using dpo or ppo, I solve it by decrease the learning rate (actor lr and critic lr) from 1e-5 to 1e-6,I think...