text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

EOS token question in multi-round in OASST

Open grimulkan opened this issue 1 year ago • 1 comments

I'm trying to figure out if the format for multi-round conversations has an EOS token appended at the end of each assistant reply in the history, or none at all (it will be added by the response), and maybe it varies by model.

It is a general question (so I'm using the generic tokens in the web ui for the example below), but I'm specifically wondering about the recent open-assistant models, because I don't think previous multi-round trains (egs., Vicuna) did it the same way.

Case 1: Input to model (with history) is:

<|user|>Question 1<|endoftext|>
<|bot|>Answer 1<|endoftext|>
<|user|>Question 2<|endoftext|>
<|bot|>Answer 2<|endoftext|>
<|user|>Question 3<|endoftext|>

Response will be:

<|bot|>Answer 3<|endoftext|> EOS_TOKEN

Case 2: Input is:

<|user|>Question 1<|endoftext|>
<|bot|>Answer 1<|endoftext|> EOS_TOKEN
<|user|>Question 2<|endoftext|>
<|bot|>Answer 2<|endoftext|> EOS_TOKEN
<|user|>Question 3<|endoftext|>

Response will be (same as Case 1):

<|bot|>Answer 3<|endoftext|> EOS_TOKEN

Looking at chat.py, it looks like Case 1 is always assumed in web ui. However, looking at the format_pairs function (Line 152 at the time of this writing) from https://github.com/LAION-AI/Open-Assistant/blob/5ce45f31ebf0a1c7b389b88c9755aea8393e8f9f/model/model_training/custom_datasets/formatting.py I'm wondering if open assistant was trained with Case 2 instead.

Anyone know any better?

grimulkan avatar Apr 30 '23 00:04 grimulkan

Adding to that (related question). Looks like the webUI actually inputs the following format in instruct mode: (slightly different than my case examples in that the extra <|bot|> prompt is part of the input, not output):

<|user|>Question 1<|endoftext|>
<|bot|>Answer 1<|endoftext|>
<|user|>Question 2<|endoftext|>
<|bot|>Answer 2<|endoftext|>
<|user|>Question 3<|endoftext|>
<|bot|>

whereas in the open assistant training script, the loss function mask includes the assistant token. So they are expecting something similar to my case examples in the previous post. Dunno if that matters either.

I wish this stuff was more standardized. We're getting there...

grimulkan avatar Apr 30 '23 01:04 grimulkan

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar May 30 '23 23:05 github-actions[bot]