transfer-learning-conv-ai Training on my own data/dialogues: Understanding the dataset format used by the code here

Training on my own data/dialogues: Understanding the dataset format used by the code here

Open Pranav-Goel opened this issue 5 years ago • 19 comments

Hello,

This is with respect to the dataset file being used by the code here at https://s3.amazonaws.com/datasets.huggingface.co/personachat/personachat_self_original.json.

Can anyone tell what the "candidate" utterances are? I could not find a description...are they simply negative samples, random utterances from other dialogues? Also, the last utterance in the "candidates" list is always the response after the chat "history" utterances - is this correct?

Thanks!

Jun 21 '19 23:06 Pranav-Goel

Hi Pranav, did you manage to understand the dataset format? I am also facing the same issue.

Jul 29 '19 08:07 nikhiljaiswal

I think I understand it: candidates is a list of possible responses, the last of which is the true response in your dialogue. So if candidates are [ 'Thank you', 'You're welcome' ]

get_data_loaders will generate an example with "Thank you" as the response and the multi-choice label of False and another example with "You're Welcome" and the multi choice label of True.

Note that only the last args.num_candidates are used.

Jul 29 '19 18:07 sshleifer

thanks @sshleifer , it helped

Jul 31 '19 04:07 nikhiljaiswal

Hi @Pranav-Goel is your network trained well in answering Custom questions? i have replicated the data with my custom answers with some questions. Its not working well in answering question for me. Can you give your inputs .

Aug 01 '19 08:08 Nagakiran1

Hi @Pranav-Goel is your network trained well in answering Custom questions? i have replicated the data with my custom answers with some questions. Its not working well in answering question for me. Can you give your inputs .

On my data, the loss can not decrease..... why..

Aug 14 '19 14:08 lemon234071

Does anyone have an example of being able to replicate a similar data structure using their own data?

Aug 30 '19 00:08 zbloss

Could anyone make a clear explanation on dataset format?

Oct 04 '19 01:10 GraphGrailAi

Just posted example_entry.py

Oct 04 '19 03:10 sshleifer

Just posted example_entry.py

Good, i have a several question about dataset's nature:

As i understand you got Facebook PERSONA-CHAT dataset for demo purposes only, to test it with many personalities. So, if i remove other personalities except 1, than will add more examples to the "personality" list, "candidates" and "history" and retrain the model i will reproduce your demo, but with only one personality(but more detailed)? Am i right?
So, other personalities doesn't affect each other during training?
As stated here https://github.com/huggingface/transfer-learning-conv-ai/blob/master/example_entry.py
candidates: [next_utterance_candidate_1, ..., next_utterance_candidate_19] and in comment above https://github.com/huggingface/transfer-learning-conv-ai/issues/15#issuecomment-516111140 candidates is all wrong random utterances except last one: candidates is a list of possible responses, the last of which is the true response in your dialogue. Should candidates list be completely random in each pair? What is the limitations of length of candidates list (or no - the more is the best)?

Oct 04 '19 13:10 GraphGrailAi

not really the motive. it was a competition with this data.
Kind of a deep question about backprop :), but they definitely do.
As long as the last utterance is ground truth, you can experiment however you'd like with distractors.

Oct 05 '19 07:10 sshleifer

[edited by sshleifer to remove email artifacts] Re 2) think that means, if other personalities affect each other - that the more pers i use the better model. That is counterintuitive on the first sight, because it seems like personality use only GPT data to answer very limited replicas

Also, it is interesting how it behave if i use gpt2-large from https://huggingface.co/transformers/pretrained_models.html

Oct 05 '19 09:10 GraphGrailAi

So what is the point in the candidate utterances, what is there purpose?

Feb 21 '20 13:02 psyfb2

Just posted example_entry.py

Hi sshleifer thanks for this repo - regarding example_entry.py, how do i train on that file once ive populated it? I'm not seeing where to reference it in the training command / convert it into some DB or something.

Jun 03 '20 20:06 made-by-chris

so if i edit example_entry.py and add my stuff do i now need to retrain the model? Or is simply restarting the interface.py sufficient

Sep 02 '20 18:09 leahcornelius

@leocornelius Do you remember how you fixed this? I'm at the same point :-)

Nov 08 '20 10:11 albusdemens

You need to have a json file with train and valid keys, each having array/list of entries

#try as below if works just populate
mydict = {}
mydict["train"] = EXAMPLE_ENTRY
mydict["valid"] = EXAMPLE_ENTRY

SO FILE SHOULD HAVE:

{
"train" : 
[   {   'personality': [   'I am inventive curious'],
        'utterances': [   {   'candidates': [   "something that bot should not say here",
                                                "also some nonsense that should not be relevant!",
                                                "last line that bot should use as answer'],
                              'history': ['Some history line', 'some other history line'],
                               } 
                            ]
    },
 ],
"valid" : 
[   {   'personality': [   'I am inventive curious'],
        'utterances': [   {   'candidates': [   "something that bot should not say here",
                                                "also some nonsense that should not be relevant!",
                                                "last line that bot should use as answer'],
                              'history': ['Some history line', 'some other history line'],
                               } 
                            ]
    },
 ]
}

Dec 18 '20 19:12 gorkemgoknar

Hi @gorkemgoknar @thomwolf , I want to ask about trainind different languages according to an article. https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313 According to this, it says start by pretraining a language model on a very large corpus of text to be able to generate long stretches of contiguous coherent text, fine-tune this language model to adapt it to our end-task: dialog. And I want to train first like Turkish Wikipedia data and then, fine tune with Turkish Dialogue Data. And is there any suggestion about this problem? Thanks for your time

Mar 24 '21 14:03 Hilal-Urun

Hi @Hilal-Urun , Following Pierre's tutorial on transfer learning on https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787 , I was able to generate Turkish gpt2 small with Wiki dump (and one more with books). A running English version of this chatbot using movie scripts is available in https://www.metayazar.com/chatbot with some 20-30 Mb cleaned text conversation data, if anyone is interested to see how it looks on live (huggingface also has their base version on their website) Small Turkish language avaılable on huggingface repository https://huggingface.co/gorkemgoknar/gpt2-small-turkish/ , anyone interested making their own version I think should check if their language version is available on huggingface.co repository. Note that I did try Turkish chatbot but a few MB of conversation is not enough for good context output.

Mar 24 '21 15:03 gorkemgoknar

Just as a matter of interest, the json parser is very pedantic, I had to change the formatting a little:

{ "train" : [ { "personality": [ "I am inventive curious"], "utterances": [ { "candidates": [ "something that bot should not say here", "also some nonsense that should not be relevant!", "last line that bot should use as answer"], "history": ["Some history line", "some other history line"] } ] } ], "valid" : [ { "personality": [ "I am inventive curious"], "utterances": [ { "candidates": [ "something that bot should not say here", "also some nonsense that should not be relevant!", "last line that bot should use as answer"], "history": ["Some history line", "some other history line"] } ] } ] }

Entering the following command then works (within the confines of this very limited personality):

python3 interact.py --dataset_path test_personality.json --dataset_cache ./test_personality_cache

Feb 14 '22 06:02 andrejburke

transfer-learning-conv-ai transfer-learning-conv-ai copied to clipboard

Training on my own data/dialogues: Understanding the dataset format used by the code here

transfer-learning-conv-ai
transfer-learning-conv-ai copied to clipboard