self_dialogue_corpus
self_dialogue_corpus copied to clipboard
The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports
The Self-dialogue Corpus
This is an early release of the Self-dialogue Corpus containing 24,165 conversations, or 3,653,313 words, across 23 topics. For more information on the data, please see our corpus paper or our submission to the Alexa Prize.
Statistics
| Category | Count |
|---|---|
| Topics | 23 |
| Conversations | 24,165 |
| Words | 3,653,313 |
| Turns | 141,945 |
| Unique users | 2,717 |
| Conversations per user | ~9 |
| Unique tokens | 117,068 |
Topics include movies, music, sports, and subtopics within these.
Using the data
corpuscontains the raw CSVs from Amazon Mechanical Turk, sorted by individual tasks (topics);blocked_workers.txtlists workers who did not comply with the requirements of the tasks, these are omitted by default;get_data.pyis a preprocessing script which will format the CSVs into text (by default saved todialogues), along with various options (see below).
get_data.py
Example usage: python get_data.py. This will by default read from corpus and write to dialogues.
Optional arguments:
--inDirDirectory to read corpus from--outDirDirectory to write processed files--output-namingwhether to name output files with integers (integer) or by assignment_id (assignment_id);--remove-punctuationremoves punctuation from the output;--set-casesets case of output tooriginal,upperorlower;--exclude-topicexcludes any of the topics (or subdirectories ofcorpus), e.g.--exclude-topic music;--include-onlyincludes only the given topics, e.g.--include-only music.
Citation
For research using this data, please cite:
@article{fainberg2018talking,
title={Talking to myself: self-dialogues as data for conversational agents},
author={Fainberg, Joachim and Krause, Ben and Dobre, Mihai and Damonte, Marco and Kahembwe, Emmanuel and Duma, Daniel and Webber, Bonnie and Fancellu, Federico},
journal={arXiv preprint arXiv:1809.06641},
year={2018}
}
@article{krause2017edina,
title={Edina: Building an Open Domain Socialbot with Self-dialogues},
author={Krause, Ben and Damonte, Marco and Dobre, Mihai and Duma, Daniel and Fainberg, Joachim and Fancellu, Federico and Kahembwe, Emmanuel and Cheng, Jianpeng and Webber, Bonnie},
journal={Alexa Prize Proceedings},
year={2017}
}