stanford_alpaca
stanford_alpaca copied to clipboard
Are there any particular reason for using `### ` for instruction, input and response?
I know that in markdown format ### is treated as section/subsection title. Are there any relevance there?
Edit to make the question clearer: In the training script, the inputs are formulated as such:
PROMPT_DICT = {
"prompt_input": (
"Below is an instruction that describes a task, paired with an input that provides further context. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
),
"prompt_no_input": (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Response:"
),
}
My question is why do we need to put the ### ? Is llama trained specifically to detect this token as something special?
ChatGPT which is used to create this dataset uses Markdown format for its response. As you mentioned ### represents heading 3 style
@adithyab94 any reference document to this?
I am curious now on what are the other special tokens?
I think that the ### is added during instruction-tuning.
I see that GPT models were pre-trained on crawled text without proper sectioning, I do not think the authors would had put so much effort to properly preprocess the text. It must be then that it is this repo (alpaca) that is enforcing ### token to act as a paragraph sectioning token.
Well, I do know that LLaMa was trained in much the same way. If you have it generate a random prompt, more often than not, you get the training code, ### Instruction, ### Input and ### Response rather than just a normal output from say, something like ChatGPT.