direct-preference-optimization Why does SFT sum the cross-entropy loss within each sequence?

Thank you for maintaining such an important repository. I really enjoyed and learned a lot from reading your DPO paper.

I have one question regarding the SFT loss implementation in the repository. Apparently, the SFT loss sums the cross entropy loss within each sequences. However, from my understanding, language modeling loss conventionally averages the cross entropy loss for all tokens within the batch (Ref: GPT2 Loss). I think this results in a difference in computing the standard cross entropy loss between TRL's SFTTrainer and this repository's SFT loss. Why is SFT implemented this way?

Feb 17 '24 07:02 yunjae-won

Same question here. Hi @YJWon99 , do you have any ideas now?

May 17 '24 06:05 HuXiangkun

has been solved? same question

Jan 03 '25 13:01 yiyepiaoling0715

@yiyepiaoling0715 I think it's a bug in their code, it should be averaged over the sequence and I made the revision in my experiments.

Jan 06 '25 01:01 HuXiangkun