Open-Assistant
Open-Assistant copied to clipboard
Proposal change of conversation template
Propose to remove the addition of special token <prefix>, </prefix>, <human>,<bot> and replaced with the model default eos (end of sentence) and bos ( begin of sentence) token instead.
The replacement are as follows:
<human> -> <bos>user\n
<bot> -> <eos>\n<bos>assistant\n
<prefix> -> <bos>
</prefix> -> <eos>
where <bos>, <eos> represent the eos and bos token of the pretrained model. And if it looks familiar its the same as chatGPT
Reason to replace:
We need to train these new embeddings and we have no way to know whether they work as separation of conversation turns or not ( some attention visualization might help?). If we instead use the eos and bos token, they are both updated for millions of iterations and might require less tuning to work well.
Reason not to replace:
Save a few token space
Implementation wise would be very easy. I would personally train a SFT model on the new format with OASST dataset and compare the both, since its a very small dataset and we can't afford to update that many iterations due to overfitting.
<human> -> <bos>user\n<bot> -> <eos>\n<bos>assistant\n
Although technically we only need two tokens I would suggest to structure this a bit more and to define SeqTokens and ChatRoles enums:
class SeqToken(str, Enum):
begin = "<|startoftext|>"
end = "<|endoftext|>"
delimiter = "\n"
and this together with three roles:
class ChatRole(str, Enum):
system = "system"
prompter = "prompter"
assistant = "assistant"
Sequence could be build up in array and then per message and then joined on delimiter:
prefix = "".join([ SeqToken.begin, ChatRole.system, "prefix...", SeqToken.end])
prompter_msg = "".join([ SeqToken.begin, ChatRole.prompter, "prompt here...", SeqToken.end])
assistant_msg = "".join([ SeqToken.begin, ChatRole.assistant, "reply here...", SeqToken.end])
conversation = SeqToken.delimeter.join([prefix, prompter_msg, assistant_msg])
We are currently using the v2.5 conversation template, which is similar to the template discussed here but without begin-sequence and delimiter tokens.