DialoGPT
DialoGPT copied to clipboard
Tokens in multi-turn setting
Hi, thanks for making the work available and for the explanations.
From the paper I understand that a training instance is a dialogue session, made up of several dialogue turns concatenated and ended by the end-of-text token.
Based on this and on what dreasysnail says in Issue #17:
There ARE special tokens (<|endoftext|>, id=50256) between dialogue turns in multi-turn setup. >Your input format should be like this:
Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN
my question is:
are the token between different dialogue turns the same as the tokens separating whole dialogue sessions?
Thank you
Hi, thanks for making the work available and for the explanations.
From the paper I understand that a training instance is a dialogue session, made up of several dialogue turns concatenated and ended by the end-of-text token.
Based on this and on what dreasysnail says in Issue #17:
There ARE special tokens (<|endoftext|>, id=50256) between dialogue turns in multi-turn setup. >Your input format should be like this: Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN
my question is:
are the token between different dialogue turns the same as the tokens separating whole dialogue sessions?
Thank you
If I understand right, there are NO tokens between dialogue sessions. Because one dialogue session is one training example and contains source (utt1 <|eos|> utt2 <|eos|> utt3) and target (utt4). Next session passed to the model as another training sample.
Thank you liethman, this is very helpful.
My current, updated understanding is that the .tsv file must be in the format you described, with a \t between the source (utt1 <|eos|> utt2 <|eos|> utt3) and the target (utt4).
Then, the prepro.py will create the features, that end with an <|endoftext|> token (id=50256).
- here, interested too
I successfully managed to fine-tune the model with input data in this form: each line of the .tsv file is a dialogue, with each turn separated by <|eos|> and a tab that separates the target from the rest of the dialogue.
A sample training instance is therefore : utt1 <|eos|> utt2 <|eos|> utt3 \t target \n
Hi, @ferdinando17 . I am trying to fine-tune the model with my own dataset. I failed to run python demo.py --data small so that I can't know the exact format of the .tsv file. After reading some codes, I agree with your opinion. Could you please help me confirm if the format of my data set(.tsv file) is correct:
0.0 utt1 EOS 1.0 utt2 EOS 1.0 utt3 \t 1.0 i am a admin .\n
Hope to get your reply. Thanks.
Hi, you are missing the tab, it should be "0.0 utt1 0.0 EOS utt2 0.0 EOS utt3 \t 1.0 i am a admin .\n"
to ask dialoGPT to predict "i am a admin. " Look at my example.
Also, the zeros mean you are not training on all the utterances that follow them, is it what you want?
Hi, @ferdinando17 , this is what bothers me. In multi-turn dialog, we have several previous turns as context, one user turn as the question and one system turn as the answer. Through your explanation, I realized that it should be
0.0 utt1 EOS 1.0 utt2 EOS 1.0 utt3 \t 1.0 i am a admin .\n
as the example in the training/fine-tuning dataset, where only the first sentence should be 0.0 and the remaining sentences should be 1.0 to train/fine-tune the model regardless of the user turn or the system turn. (Actually I am confused that should I distinguish between user and system turns: 0.0 to user turn and 1.0 to system turn, so that the model only need to predict each system turn. Because the model just need to predict the system utterance in the evaluation. But maybe all 1.0 will help train the model with more data.) Is that correct? Hope to get your reply. Thanks. 🙏
Are you applying it to task-oriented dialogue?
I understand that the 0.0 are for those sentences that you want to filter, the authors used it to avoid training on offensive language. I used all 1.0 and my training instances where of the form I specified, where target was always a system turn.
I hope it makes sense.
Hi, @ferdinando17. Thank you for your reply. Yes, I am trying to apply it to the task-oriented dialogue. In my understanding, I think 0.0 would make the model not make predictions about this sentence, and 1.0 would make the model make predictions about this sentence. So I think it is ok to train the model by making the first sentence of each multi-turn dialog 0.0, as the context information, and making the rest of the statements 1.0. Also, we can make every user turn 0.0 and each system turns 1.0. Maybe more experiments about two different settings are needed. Thanks again for your reply.
Ok, I see. I disagree, but of course I might be wrong. In this issue, another user says 0.0 causes the sentences to be ignored in the training. They refer to the hugginface docs too.
Let me know if you find evidence of the contrary.
hi gays,how do i deal with datasets like this : person1: utt1, person2 : utt2, person1: utt3 ... by refering what you all says i think it should like this: 1.0 utt1 EOS 1.0 utt2 EOS \t 1.0 utt3
is this correct?
also i just wonder what should the validation set looks like