DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

SFT loss

Open ruidongtd opened this issue 1 year ago • 1 comments

https://github.com/microsoft/DeepSpeedExamples/blob/d570b2cc8a8fd4207c9424744669437d4c68ec43/applications/DeepSpeed-Chat/training/utils/data/data_utils.py#L122

if self.train_phase == 1:
    return {
        "input_ids": self.chosen_dataset[idx]["input_ids"],
        "attention_mask": self.chosen_dataset[idx]["attention_mask"],
        "labels": self.chosen_dataset[idx]["input_ids"]
    }

In the SFT stage, input_ids and labels are the same, so the loss calculation includes the prompt's loss. Shouldn't we only calculate the chosen loss instead?

ruidongtd avatar Apr 14 '23 08:04 ruidongtd

This is the common setting for casual language modeling. Please take a look at GPT paper or the HF CLM example

yaozhewei avatar Apr 18 '23 16:04 yaozhewei

Close as the issue seems solved. Feel free to reopen for any further questions.

conglongli avatar Apr 21 '23 22:04 conglongli

The loss of the prompt in the original question, it was excluded in the pre-training stage such as GPT2 or GPT3. Whether to do the same in the SFT stage, I haven't found relevant information yet. It seems that removal or retention makes sense, maybe the experiment tells us the answer?

@yaozhewei @conglongli I think there is a more important issue about the calculation of loss and it is necessary to reopen this issue.

image We can refer to the implementation of huggingface here, and we can see that the output is obtained by shifting the input to the left.

PS:I went to confirm that the opt model is also decoder-only, the same as GPT2 or GPT3.

xf4fresh avatar Apr 24 '23 10:04 xf4fresh

I also think this issue makes sense cause many well-known SFT work like alpaca and vicuna exclude the prompt's loss. Anyway, it's an interesting thing to explore whether we should include prompt's loss or not.

slatter666 avatar Jul 26 '23 11:07 slatter666