Why does SFT sum the cross-entropy loss within each sequence?
Thank you for maintaining such an important repository. I really enjoyed and learned a lot from reading your DPO paper.
I have one question regarding the SFT loss implementation in the repository. Apparently, the SFT loss sums the cross entropy loss within each sequences. However, from my understanding, language modeling loss conventionally averages the cross entropy loss for all tokens within the batch (Ref: GPT2 Loss). I think this results in a difference in computing the standard cross entropy loss between TRL's SFTTrainer and this repository's SFT loss. Why is SFT implemented this way?
Same question here. Hi @YJWon99 , do you have any ideas now?
has been solved? same question
@yiyepiaoling0715 I think it's a bug in their code, it should be averaged over the sequence and I made the revision in my experiments.