Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

add alpaca gpt4 dataset

Open CloseChoice opened this issue 1 year ago • 3 comments

The inputs can be quite a lot of different versions of no input, therefore don't use the input column for that. In some cases the text in input is already in the instruction, in these cases, we also don't use the input column.

I am not quite sure how to concatenate the instruction and the input column. In most cases it seems fine to just replace last appearance of ., ! or ? with a colon, e.g.: Instruction: Identify the odd one out. Input: Twitter, Instagram, Telegram or Instruction: How dense is a given material? Input: Steel

But we also have some questions like: Instruction: Given the following synopsis, what is the moral lesson of this story? Input: Once upon a time, there was a poor young boy who wanted some candy. He begged his father for money to buy it, but his father said no and ordered him to go to bed. As he was going to bed, the boy saw a five-dollar bill on the counter, which he took and bought the candy.

Where this might not be the best case. Either way, I think the this one token will not make significant difference the model and therefore I just concatenate instruction and input with a space.

CloseChoice avatar Apr 16 '23 15:04 CloseChoice

Looks good!

I don't think we should replace the grammar. Maybe just sample \n\n, \n and space. I mean when I use GPT I usually use new lines.

jordiclive avatar Apr 16 '23 16:04 jordiclive

I like Jordi's proposal. @CloseChoice do you think you could add this? .. similar to summarization.py#L151 (just simpler for the different whitespace ..)

andreaskoepf avatar Apr 17 '23 14:04 andreaskoepf

Updated.

CloseChoice avatar Apr 18 '23 04:04 CloseChoice