DeepSpeedExamples
DeepSpeedExamples copied to clipboard
SFT loss
https://github.com/microsoft/DeepSpeedExamples/blob/d570b2cc8a8fd4207c9424744669437d4c68ec43/applications/DeepSpeed-Chat/training/utils/data/data_utils.py#L122
if self.train_phase == 1:
return {
"input_ids": self.chosen_dataset[idx]["input_ids"],
"attention_mask": self.chosen_dataset[idx]["attention_mask"],
"labels": self.chosen_dataset[idx]["input_ids"]
}
In the SFT stage, input_ids and labels are the same, so the loss calculation includes the prompt's loss. Shouldn't we only calculate the chosen loss instead?
This is the common setting for casual language modeling. Please take a look at GPT paper or the HF CLM example
Close as the issue seems solved. Feel free to reopen for any further questions.
The loss of the prompt in the original question, it was excluded in the pre-training stage such as GPT2 or GPT3. Whether to do the same in the SFT stage, I haven't found relevant information yet. It seems that removal or retention makes sense, maybe the experiment tells us the answer?
@yaozhewei @conglongli I think there is a more important issue about the calculation of loss and it is necessary to reopen this issue.
We can refer to the implementation of huggingface here, and we can see that the output is obtained by shifting the input to the left.
PS:I went to confirm that the opt model is also decoder-only, the same as GPT2 or GPT3.
I also think this issue makes sense cause many well-known SFT work like alpaca and vicuna exclude the prompt's loss. Anyway, it's an interesting thing to explore whether we should include prompt's loss or not.