datasets icon indicating copy to clipboard operation
datasets copied to clipboard

datasets for NLP research

Datasets

dialog system

  • Weibo (zh):https://ai.tencent.com/ailab/nlp/dialogue/#datasets (Weibo Conversation Datasets)

  • Douban (zh): https://github.com/MarkWuNLP/MultiTurnResponseSelection

  • Douban-20k (zh): https://ai.tencent.com/ailab/nlp/dialogue/#datasets (Restoration-200K datasets)

  • Weibo Emotional Conversation Dataset (zh): http://coai.cs.tsinghua.edu.cn/hml/challenge2017/

  • Profile Consistency Dataset for Dialogue (zh): https://ai.tencent.com/ailab/nlp/en/dialogue/datasets/KvPI.zip (paper: https://arxiv.org/abs/2009.09680)

  • Grayscale Dataset for Dialogue (zh): https://ai.tencent.com/ailab/nlp/en/dialogue/datasets/grayscale_data_release.zip (https://arxiv.org/abs/2004.02421)

  • Gender-Specific Chat (zh): https://ai.tencent.com/ailab/nlp/en/dialogue/datasets/Stylistic_Dataset.zip (https://arxiv.org/abs/2004.02202)

  • Twitter (en):https://github.com/Marsan-Ma-zz/chat_corpus

  • DailyDialog (en): http://yanran.li/dailydialog.html

  • PersonaChat (en): https://s3.amazonaws.com/datasets.huggingface.co/personachat/personachat_self_original.json

  • OpenSubtitles (en): http://opus.nlpl.eu/OpenSubtitles.php

  • MultiWOZ (en): https://www.repository.cam.ac.uk/handle/1810/294507

  • Cornell (en): https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

  • Topical-Chat (en): https://github.com/alexa/alexa-prize-topical-chat-dataset

  • Switchboard (en): https://github.com/NathanDuran/Switchboard-Corpus

  • Dialogue NLI (en): https://wellecks.github.io/dialogue_nli/

  • Movie Dialog Reddit (en): https://research.fb.com/downloads/babi/

  • Ubuntu Dialogue (en) :https://github.com/rkadlec/ubuntu-ranking-dataset-creator

  • EmpatheticDialogues (en): https://github.com/facebookresearch/EmpatheticDialogues

  • Wizard of Wikipedia (en): https://parl.ai/projects/wizard_of_wikipedia/

  • Commonsense Conversation (en): http://coai.cs.tsinghua.edu.cn/file/commonsense_conversation_dataset.tar.gz

  • MuTual (en): https://github.com/Nealcly/MuTual

table-to-text

  • ToTTo Dataset: https://github.com/google-research-datasets/ToTTo

text generation

  • poetry (zh): https://github.com/chinese-poetry/chinese-poetry

  • couplet (zh): https://github.com/wb14123/couplet-dataset

summarization

Multi-Document (MDS)

  • DUC: https://duc.nist.gov/

  • TAC: https://tac.nist.gov/data/

  • RAMDS (cuhk) : http://www.se.cuhk.edu.hk/~textmine/dataset/ra-mds/

Single-Document (SDS)

  • LCSTS (zh): http://icrc.hitsz.edu.cn/Article/show/139.html

  • Gigaword (en) https://drive.google.com/file/d/0B6N7tANPyVeBNmlSX19Ld2xDU1E/view

  • CNN/Daily Mail (en): https://github.com/abisee/cnn-dailymail

  • scientific summarization (en): https://github.com/Santosh-Gupta/ScientificSummarizationDataSets

  • Newsroom (en): https://summari.es/download/

  • BigPatent (en): https://drive.google.com/uc?export=download&id=1mwH7eSh1kNci31xduR4Da_XcmTE8B8C3

  • XSum (en): http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Knowledge Base

  • ownthink (zh) : https://github.com/ownthink/KnowledgeGraphData