i-am-a-nerd icon indicating copy to clipboard operation
i-am-a-nerd copied to clipboard

Open-Dialog Chatbots for Learning New Languages [Part 1] | IAmANerd

Open utterances-bot opened this issue 4 years ago • 18 comments

Open-Dialog Chatbots for Learning New Languages [Part 1] | IAmANerd

How to fine-tune the DialoGPT model on a new dataset or language for open-dialog conversational chatbots.

https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html

utterances-bot avatar Jun 12 '20 21:06 utterances-bot

Thank you! Both this notebook and the Data Preprocessing colab are incredibly helpful.

virattt avatar Jun 12 '20 21:06 virattt

@virattt Glad it was helpful!

ncoop57 avatar Jun 12 '20 23:06 ncoop57

is there a limit for the set of conversations?

monisha08041998 avatar Jun 30 '20 04:06 monisha08041998

@monisha08041998 It was only trained on a max length of 9 conversations, so going beyond that may leads to poor results. Also, the max length of the entire conversation that GPT-2 will consider when predicting new tokens is 512, so even if you go beyond that it can only look at that many for context.

ncoop57 avatar Jun 30 '20 15:06 ncoop57

how are the pre-trained weights going to help here as the new data is completely different

bhuvan1643 avatar Sep 15 '20 14:09 bhuvan1643

how did you use pre-trained tokenizer here, as pre-trained one contains only english words but data here is spanish

bhuvan1643 avatar Sep 15 '20 19:09 bhuvan1643

@bhuvan1643 DialoGPT used the original GPT2 model, pretrained weights, and tokenizer. Even though the vast majority of the data was English, it still contained some Spanish text and therefore the necessary Spanish characters/words.

I am not 100% sure the pre-trained weights help with modeling the Spanish language. However, Spanish has a lot of overlap in vocabulary and grammatical structure with English since they are both romance languages like French and German. This overlap may help the model transfer its knowledge from English to Spanish.

I'm not sure how well this would work on non-romance languages like Chinese, Hindi, etc since there are almost no overlap even if you converted the words/characters to their Latin versions.

ncoop57 avatar Sep 15 '20 19:09 ncoop57

Where did you train large model? Is there any cloud service or something like that?

TheHmmka avatar Oct 16 '20 18:10 TheHmmka

@TheHmmka I trained the larger model on one of my school's machines that had 4 1080ti's. I'm sure you could train it on a cloud service relatively easily though, but I've never had experience with those.

ncoop57 avatar Jan 28 '21 18:01 ncoop57

I cannot download the data( subtitles of Spanish TV shows ), the script to generate a csv cannot be accessed either. Can you please fix them? Thanks

etrigger avatar Jan 29 '21 03:01 etrigger

Hey @etrigger, could you show me the error you are getting when trying to download or generate the data? I tried to reproduce this, but it was working for me

ncoop57 avatar Jan 29 '21 20:01 ncoop57

I can download the data now, but the script can't be opened. https://colab.research.google.com/drive/1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS?usp=sharing

Here are the error: A network error occurred and the request could not be completed.

https://drive.google.com/drive/?action=locate&id=1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS&authuser=0 A network error occurred and the request could not be completed. GapiError: A network error occurred and the request could not be completed. at pz.Vs [as constructor] (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:704:150) at new pz (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1225:318) at Da.program_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1359:470) at Fa (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:19:336) at Da.throw_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:18:402) at Ia.throw (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:20:248) at g (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:62:155)

etrigger avatar Jan 30 '21 01:01 etrigger

@etrigger what an interesting error. I did a bit of digging and it seems to be an issue with colab in certain situation. Here is an issue about it: https://github.com/googlecolab/colabtools/issues/1771, but it seem like it just automagically got fixed for the person who opened it. I'd recommend trying with a different browser or in incognito mode on the browser you are using to see if that fixes it. I don't think there is anything I can do from my side other than giving you access to a converted python script so you can download it yourself. Here is a link to it where you could just download the file and run it locally if you want (though be careful because it takes a lot of compute, networking and memory to generate the CSV, especially for languages that have a ton of examples): https://drive.google.com/file/d/1qvIh3zztJT7TelMYLdahOoGmypw398VD/view?usp=sharing

ncoop57 avatar Jan 30 '21 13:01 ncoop57

@ncoop57 Thanks for the script.

etrigger avatar Feb 03 '21 08:02 etrigger

@ncoop57 Question on preparing training data format? I have the dialog data like this: each line has the sentence A (source) followed by sentence B(target). How should I organize the data for training?

etrigger avatar Feb 03 '21 09:02 etrigger

@etrigger I have the format that dialoGPT requires in the data section of my blog: https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html#The-Data!. I recommend trying to first get it into a format that my code expects (each column having a different response) and then tossing it into that function to generate the necessary input data for your model

ncoop57 avatar Feb 03 '21 23:02 ncoop57

I have a question about the defined train and evaluate functions. Both have: inputs, labels = (batch, batch) meaning that inputs and labels are exactly the same. My question is: Shouldn't the model try to learn how to respond to the given input? I feel like there is something wrong with that in this case.

berkozg96 avatar Jul 17 '21 19:07 berkozg96

You're correct, it doesn't make sense for the inputs and labels to be the same in a train or evaluation function for a conversational chatbot. The goal of the model is to learn to generate a response given an input, so the inputs should be questions or prompts, and the labels should be the corresponding answers. The model's performance is usually evaluated by comparing the generated response to the ground truth answer in the label variable.

If the inputs and labels are the same, the model would simply memorize the training data and wouldn't be able to generalize to new examples. So it's important to ensure that the inputs and labels are distinct, with the inputs being used to prompt the model to generate a response, and the labels being used to evaluate the quality of the generated response.

Viile1 avatar Feb 08 '23 17:02 Viile1