ParlAI How to train for language modeling

Hi! I have a question about tasks.

Does a language modeling task alredy exist in parlai task list? I'm approaching to train a 90M transformer generator from scratch on new data and I first want to learn a new language model in order to use a new language. I would to know if I can recycle a lm task and i'm also curious to know how the reddit task was made.

If you could help me I will really appreciate it. Thank you for your work.

Jul 22 '22 09:07 gitsand996

at the moment, I don't believe there are any pure language-modeling tasks within parlai. we'd be open to accepting additional datasets

Jul 22 '22 17:07 klshuster

@klshuster What is the difference between a Blended Skill Talk task and a pure LM task? The objective is to predict the next sentence conditioning only on the previous sentence, right?

Jul 25 '22 07:07 gitsand996

The dialogue tasks are a subset of the LM tasks. When we talk about the general LM tasks, it could be anything from writing an article, summarizing it, or predicting the next utterance in the dialogue. Blended Skill Talk task is for further fine-tuning your models on dialogue.

Jul 25 '22 14:07 mojtaba-komeili

Fine, so if for example I would to train a LM from scratch I have to implement a task for causal language modeling, s.a. the model have to generate a comment based on previous post/comment, and this training has to be unsupervised. is it right?

Aug 05 '22 08:08 gitsand996

I didn't exactly get your point. But technically if your unsupervised dataset is large enough and has enough conversations, that should be enough. But that needs A LOT of data often. That's why extra fine-tuning on conversation datasets helps.

Aug 05 '22 14:08 mojtaba-komeili

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

Sep 05 '22 00:09 github-actions[bot]