Yanggan Gu
Yanggan Gu
I noticed that the `scores` in `reward_fn` is actually equal to `logits_i - logsumexp(logits)`. I think this expression can be calculated directly by `log_softmax`. Why not use `log_softmax`? https://github.com/microsoft/LMOps/blob/5fbf5bcd6e6760fa95aaaf945fb5d9cb033135f6/minillm/minillm/reward.py#L33
When I looked at the examples I found that the example script for DPO uses `apply_chat_template` for `chosen` and `rejected` but not for `prompt`. https://github.com/huggingface/trl/blob/d1ed730ab8281b1b0c78d7d61bc0f6603a9ce958/examples/scripts/dpo.py#L150-L152 And it seems that `chosen`...