DeepSpeedExamples SFT loss

https://github.com/microsoft/DeepSpeedExamples/blob/d570b2cc8a8fd4207c9424744669437d4c68ec43/applications/DeepSpeed-Chat/training/utils/data/data_utils.py#L122

if self.train_phase == 1:
    return {
        "input_ids": self.chosen_dataset[idx]["input_ids"],
        "attention_mask": self.chosen_dataset[idx]["attention_mask"],
        "labels": self.chosen_dataset[idx]["input_ids"]
    }

In the SFT stage, input_ids and labels are the same, so the loss calculation includes the prompt's loss. Shouldn't we only calculate the chosen loss instead?

Apr 14 '23 08:04 ruidongtd

This is the common setting for casual language modeling. Please take a look at GPT paper or the HF CLM example

Apr 18 '23 16:04 yaozhewei

Close as the issue seems solved. Feel free to reopen for any further questions.

Apr 21 '23 22:04 conglongli

The loss of the prompt in the original question, it was excluded in the pre-training stage such as GPT2 or GPT3. Whether to do the same in the SFT stage, I haven't found relevant information yet. It seems that removal or retention makes sense, maybe the experiment tells us the answer?

@yaozhewei @conglongli I think there is a more important issue about the calculation of loss and it is necessary to reopen this issue.

We can refer to the implementation of huggingface here, and we can see that the output is obtained by shifting the input to the left.

PS：I went to confirm that the opt model is also decoder-only, the same as GPT2 or GPT3.

Apr 24 '23 10:04 xf4fresh

I also think this issue makes sense cause many well-known SFT work like alpaca and vicuna exclude the prompt's loss. Anyway, it's an interesting thing to explore whether we should include prompt's loss or not.

Jul 26 '23 11:07 slatter666

DeepSpeedExamples DeepSpeedExamples copied to clipboard

SFT loss

DeepSpeedExamples
DeepSpeedExamples copied to clipboard