wnzhyee

Results 5 comments of wnzhyee

i have same problems, have you solved it ?

https://github.com/uclaml/SPIN/blob/e84b7be111b41b388367e591bdc23e327725c869/spin/alignment/trainer.py#L405 In spin_loss difinition, at steps 0, the loss value starts with a fixed value 0.6931, when p_theta equals to p_theta_t

https://github.com/uclaml/SPIN/blob/e84b7be111b41b388367e591bdc23e327725c869/spin/alignment/trainer.py#L405 In spin_loss difinition, at steps 0, the loss value starts with a fixed value 0.6931, when p_theta equals to p_theta_t

> DPO relies on the Bradley-Terry (BT) mode or the more general Plackett-Luce models, matching outcomes of pairwise comparisons directly with an implicit reward model. Therefore, the core DPO methodology...