stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

What exactly is the "supervised" task?

Open pGit1 opened this issue 2 years ago • 2 comments

Are we still doing next token prediction just on this specific dataset or more general seq2seq training where we map arbitrary length input sequences to arbitrary length output sequences?

For instance given this data:

{
        "instruction": "Identify the odd one out.",
        "input": "Twitter, Instagram, Telegram",
        "output": "Telegram"
    }

what exactly is the objective? Is it to take a sequence "Identify the odd one out. <some sep>Twitter, Instagram, Telegram <some sep> Telegram" and try to predict "<ignore>Identify the odd one out.<some sep>Twitter, Instagram, Telegram<some sep>Telegram<eos>" (i.e. inputs shifted by to the right by one as in traditional teacher forcing) OR is it something else?

pGit1 avatar Mar 25 '23 09:03 pGit1

I have the same question.

zhangfaen avatar Mar 25 '23 10:03 zhangfaen

@zhangfaen I think ALL of this "supervised" finetuning confusion stems from annoying use of terms on part of the community as popularized by the "SFT" portion of this paper: https://openreview.net/pdf?id=TG8KACxEONSee section 3.4.

I kept thinking something weird/magical was going on that I did not understand with the Trainer class but actually there isnt. At best its simply doing labels = "shift inputs to the right by one" for next token prediction, and at worst its doing nothing as far as I can tell, depending on what you do in the datacollate funtion for the Trainer class.

What this term "supervised fine tuning" means in this context is ONLY related to the fact that the model learns from human or machine generated Input and Output pairs (hence the "supervision"). Since the input and output papers are made by a human or machine they term this "supervised" learning, since each input is associated with a specific output DESPITE the actual objective for training the model is STILL NEXT TOKEN PREDICTION.

This is where my confusion lied and perhaps yours as well. Hopefully that clears things up. You can see here how they align input_ids and labels https://github.com/tatsu-lab/stanford_alpaca/blob/eb5b171d9b103a12a8e14e0edca9cbc45fe1d512/train.py#L110. All the magic of "shifting inputs to the right" must get handled by all their subsequent use padding functions, collate functions, and trainer class. Unfortunately there are no comments in the train code but, if this isnt what is going on I am simply lost...

I could be wrong so take my answer with a grain of salt. Maybe there is something else going on. :man_shrugging:

Could be they arent doing next token prediction at all and just learning tokens 1to1??

pGit1 avatar Mar 25 '23 14:03 pGit1

@zhangfaen My above answer is mostly correct. I answered my own question. All these people are doing is next word prediction in standard "teacher forcing" setup. Its just all obfuscated by lack of comments and transformers library code. See here:https://github.com/tloen/alpaca-lora/issues/171. The bottom line is, they set labels and input_ids equal above as, I mentioned, and then the model does the label shifting for next word prediction. So this "supervised training" boils down to next word prediction on this particular instructions dataset. Only difference is that this repo may only be computing loss on the "outputs" not the inputs and instructions.

pGit1 avatar Mar 25 '23 19:03 pGit1

@pGit1 see https://github.com/tatsu-lab/stanford_alpaca/blob/eb5b171d9b103a12a8e14e0edca9cbc45fe1d512/train.py#L131

labels = copy.deepcopy(input_ids) for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]): label[:source_len] = IGNORE_INDEX

label is not equal to input_id, because label[0:source_len] is set IGNORE_INDEX (source is "instruction + inputs, source has no outputs).

zhangfaen avatar Mar 26 '23 12:03 zhangfaen

@zhangfaen,

I think all that is happening here is that they are "masking" the instruction+inputs so that loss is only completed on the outputs. Since its next word prediction they could technically choose to use all the sequence to compute the loss but instead they only focus on the output portion of the labels, hence the ignore index. Good catch. Does that make sense?

On Sun, Mar 26, 2023 at 8:52 AM zhangfaen @.***> wrote:

@pGit1 https://github.com/pGit1 see https://github.com/tatsu-lab/stanford_alpaca/blob/eb5b171d9b103a12a8e14e0edca9cbc45fe1d512/train.py#L131

labels = copy.deepcopy(input_ids) for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]): label[:source_len] = IGNORE_INDEX

label is not equal to input_id, because label[0:source_len] is set IGNORE_INDEX (source is "instruction + inputs, source has no outputs).

— Reply to this email directly, view it on GitHub https://github.com/tatsu-lab/stanford_alpaca/issues/140#issuecomment-1484087431, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADKT4SWEPBALG73MRWHDPDTW6A3XVANCNFSM6AAAAAAWHNIHZY . You are receiving this because you were mentioned.Message ID: @.***>

pGit1 avatar Mar 27 '23 16:03 pGit1

@zhangfaen, I think all that is happening here is that they are "masking" the instruction+inputs so that loss is only completed on the outputs. Since its next word prediction they could technically choose to use all the sequence to compute the loss but instead they only focus on the output portion of the labels, hence the ignore index. Good catch. Does that make sense? On Sun, Mar 26, 2023 at 8:52 AM zhangfaen @.> wrote: @pGit1 https://github.com/pGit1 see https://github.com/tatsu-lab/stanford_alpaca/blob/eb5b171d9b103a12a8e14e0edca9cbc45fe1d512/train.py#L131 labels = copy.deepcopy(input_ids) for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]): label[:source_len] = IGNORE_INDEX label is not equal to input_id, because label[0:source_len] is set IGNORE_INDEX (source is "instruction + inputs, source has no outputs). — Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADKT4SWEPBALG73MRWHDPDTW6A3XVANCNFSM6AAAAAAWHNIHZY . You are receiving this because you were mentioned.Message ID: @.>

Yes, I agree. The difference between pretraining and supervised training only in whether the source sequence is involved in the loss calculation.

nieallen avatar Apr 17 '23 09:04 nieallen

@zhangfaen yup, thats it lol. :) So many annoying terms for simple stuff but it all makes sense in the end.

pGit1 avatar Apr 21 '23 01:04 pGit1

@pGit1 and @zhangfaen If possible I would like to ask you two questions regarding that "supervised" part.

I read your discussion but two questions still remains for me.

  1. What is the difference from a instruction following fine-tuning process to a NON instruction fine tuning?

  2. Is the fine-tuning process exactly the same for the two instruction fine-tuning and non instruction fine-tuning?

The straightforward response is very obvious. Instruction fine-tuning organize the dataset as instructions where NON instruction fine tuning don't. But if possible I would like to confirm that.

Take this tutorial from HugginFace for example. They're showing how we can fine-tune a bert model using yelp review dataset. That dataset is labeled, thus I imagine that such fine tuning process it is supervised as well as the instruction following process for the Alpaca since we have a formatted dataset as the template shown at the bottom.

Basically the difference from the two datasets (Yelp VS Alpaca) is that one follows a template for instruction tuning and the other don't. While both follow supervision because they use labeled data set to fine-tune their models.

Finally, it seems to me then that the big main difference from the two processes is how the data is formatted and not how it is trained since both are supervised. Is that correct?

Example from alpaca_data.json:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction: Write a sentence that indicates the given time frame.
### Input: Two days
### Response: The time frame being referred to is two days.

igor17400 avatar Jul 11 '23 07:07 igor17400

Your intuition is correct and I will confirm that it is. Finetuning in ANY context is dependent on the relevant task users have in mind.

For instruction finetuning, as you correctly point out, the user wants a pretrained language model to "follow instructions" and complete tasks based on inputs from users. In in Non-instruction finetuning the, users might be interested in some other task like "sentiment analysis" or classifying reviews as negative or positive, or detecting "hate speech". In such a case you would finetune the model to classify reviews, but you could finetune it for any other task, like proving theorems, doing better arithmetic, summarizing long articles, etc. Hope that makes sense.

And indeed both are supervised because in both cases you have X,Y pairs feeding the models. Inputs and "desired outputs" (aka targets) which is classical supervised learning in general.

On Tue, Jul 11, 2023 at 3:55 AM igor17400 @.***> wrote:

@pGit1 https://github.com/pGit1 and @zhangfaen https://github.com/zhangfaen If possible I would like to ask you two questions regarding that "supervised" part.

I read your discussion but two questions still remains for me.

What is the difference from a instruction following fine-tuning process to a NON instruction fine tuning? 2.

Is the fine-tuning process exactly the same for the two instruction fine-tuning and non instruction fine-tuning?

The straightforward response is very obvious. Instruction fine-tuning organize the dataset as instructions where NON instruction fine tuning don't. But if possible I would like to confirm that.

Take this tutorial https://huggingface.co/docs/transformers/training from HugginFace for example. They're showing how we can fine-tune a bert model using yelp review https://huggingface.co/datasets/yelp_review_full dataset. That dataset is labeled, thus I imagine that such fine tuning process it is supervised as well as the instruction following process for the Alpaca since we have a formatted dataset as the template shown at the bottom.

Basically the difference from the two datasets (Yelp VS Alpaca) is that one follows a template for instruction tuning and the other don't. While both follow supervision because they use labeled data set to fine-tune their models.

Finally, it seems to me then that the big main difference from the two processes is how the data is formatted and not how it is trained since both are supervised. Is that correct?

Example from alpaca_data.json:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction: Write a sentence that indicates the given time frame.

Input: Two days

Response: The time frame being referred to is two days.

— Reply to this email directly, view it on GitHub https://github.com/tatsu-lab/stanford_alpaca/issues/140#issuecomment-1630333255, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADKT4SX6KYOSO5BTRKZLZSLXPUBGLANCNFSM6AAAAAAWHNIHZY . You are receiving this because you were mentioned.Message ID: @.***>

pGit1 avatar Jul 23 '23 01:07 pGit1

Thank you for the response @pGit1 !

igor17400 avatar Aug 01 '23 07:08 igor17400