Deng Dong

Results 1 comments of Deng Dong

I've also encoutered this problem when i trained using dpo or ppo, I solve it by decrease the learning rate (actor lr and critic lr) from 1e-5 to 1e-6,I think...