Open-Assistant
Open-Assistant copied to clipboard
add alpaca gpt4 dataset
The inputs can be quite a lot of different versions of no input
, therefore don't use the input
column for that.
In some cases the text in input
is already in the instruction, in these cases, we also don't use the input
column.
I am not quite sure how to concatenate the instruction
and the input
column. In most cases it seems fine to just replace last appearance of .
, !
or ?
with a colon, e.g.:
Instruction: Identify the odd one out.
Input: Twitter, Instagram, Telegram
or
Instruction: How dense is a given material?
Input: Steel
But we also have some questions like:
Instruction: Given the following synopsis, what is the moral lesson of this story?
Input: Once upon a time, there was a poor young boy who wanted some candy. He begged his father for money to buy it, but his father said no and ordered him to go to bed. As he was going to bed, the boy saw a five-dollar bill on the counter, which he took and bought the candy.
Where this might not be the best case. Either way, I think the this one token will not make significant difference the model and therefore I just concatenate instruction and input with a space.
Looks good!
I don't think we should replace the grammar. Maybe just sample \n\n, \n and space. I mean when I use GPT I usually use new lines.
I like Jordi's proposal. @CloseChoice do you think you could add this? .. similar to summarization.py#L151 (just simpler for the different whitespace ..)
Updated.