Multimodal datasets
This repository is build in association with our position paper on "Multimodality for NLP-Centered Applications: Resources, Advances and
Frontiers".
As a part of this release we share the information about recent multimodal datasets which are available for research purposes.
We found that although 100+ multimodal language resources are available in literature for various NLP tasks, still publicly available multimodal datasets are under-explored for its re-usage in subsequent problem domains.
Multimodal datasets for NLP Applications
- Sentiment Analysis
| Dataset |
Title of the Paper |
Link of the Paper |
Link of the Dataset |
| EmoDB |
A Database of German Emotional Speech |
Paper |
Dataset |
| VAM |
The Vera am Mittag German Audio-Visual Emotional Speech Database |
Paper |
Dataset |
| IEMOCAP |
IEMOCAP: interactive emotional dyadic motion capture database |
Paper |
Dataset |
| Mimicry |
A Multimodal Database for Mimicry Analysis |
Paper |
Dataset |
| YouTube |
Towards Multimodal Sentiment Analysis:Harvesting Opinions from the Web |
Paper |
Dataset |
| HUMAINE |
The HUMAINE database |
Paper |
Dataset |
| Large Movies |
Sentiment classification on Large Movie Review |
Paper |
Dataset |
| SEMAINE |
The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent |
Paper |
Dataset |
| AFEW |
Collecting Large, Richly Annotated Facial-Expression Databases from Movies |
Paper |
Dataset |
| SST |
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank |
Paper |
Dataset |
| ICT-MMMO |
YouTube Movie Reviews: Sentiment Analysis in an AudioVisual Context |
Paper |
Dataset |
| RECOLA |
Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions |
Paper |
Dataset |
| MOUD |
Utterance-Level Multimodal Sentiment Analysis |
Paper |
|
| CMU-MOSI |
MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos |
Paper |
Dataset |
| POM |
Multimodal Analysis and Prediction of Persuasiveness in Online Social Multimedia |
Paper |
Dataset |
| MELD |
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations |
Paper |
Dataset |
| CMU-MOSEI |
Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph |
Paper |
Dataset |
| AMMER |
Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning |
Paper |
On Request |
| SEWA |
SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild |
Paper |
Dataset |
| Fakeddit |
r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection |
Paper |
Dataset |
| CMU-MOSEAS |
CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French |
Paper |
Dataset |
| MultiOFF |
Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text |
Paper |
Dataset |
| MEISD |
MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations |
Paper |
Dataset |
| TASS |
Overview of TASS 2020: Introducing Emotion |
Paper |
Dataset |
| CH SIMS |
CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality |
Paper |
Dataset |
| Creep-Image |
A Multimodal Dataset of Images and Text |
Paper |
Dataset |
| Entheos |
Entheos: A Multimodal Dataset for Studying Enthusiasm |
Paper |
Dataset |
- Machine Translation
| Dataset |
Title of the Paper |
Link of the Paper |
Link of the Dataset |
| Multi30K |
Multi30K: Multilingual English-German Image Description |
Paper |
Dataset |
| How2 |
How2: A Large-scale Dataset for Multimodal Language Understanding |
Paper |
Dataset |
| MLT |
Multimodal Lexical Translation |
Paper |
Dataset |
| IKEA |
A Visual Attention Grounding Neural Model for Multimodal Machine Translation |
Paper |
Dataset |
| Flickr30K (EN- (hi-IN)) |
Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data |
Paper |
On Request |
| Hindi Visual Genome |
Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation |
Paper |
Dataset |
| HowTo100M |
Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models |
Paper |
Dataset |
- Information Retrieval
| Dataset |
Title of the Paper |
Link of the Paper |
Link of the Dataset |
| MUSICLEF |
MusiCLEF: a Benchmark Activity in Multimodal Music Information Retrieval |
Paper |
Dataset |
| Moodo |
The Moodo dataset: Integrating user context with emotional and color perception of music for affective music information retrieval |
Paper |
Dataset |
| ALF-200k |
ALF-200k: Towards Extensive Multimodal Analyses of Music Tracks and Playlists |
Paper |
Dataset |
| MQA |
Can Image Captioning Help Passage Retrieval in Multimodal Question Answering? |
Paper |
Dataset |
| WAT2019 |
WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset |
Paper |
Dataset |
| ViTT |
Multimodal Pretraining for Dense Video Captioning |
Paper |
Dataset |
| MTD |
MTD: A Multimodal Dataset of Musical Themes for MIR Research |
Paper |
Dataset |
| MusiClef |
A professionally annotated and enriched multimodal data set on popular music |
Paper |
Dataset |
| Schubert Winterreise |
Schubert Winterreise dataset: A multimodal scenario for music analysis |
Paper |
Dataset |
| WIT |
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning |
Paper |
Dataset |
- Question Answering
| Dataset |
Title of the Paper |
Link of the Paper |
Link of the Dataset |
| MQA |
A Dataset for Multimodal Question Answering in the Cultural Heritage Domain |
Paper |
- |
| MovieQA |
Movieqa: Understanding stories in movies through question-answering MovieQA |
Paper |
Dataset |
| PororoQA |
Deep story video story qa by deep embedded memory networks |
Paper |
Dataset |
| MemexQA |
MemexQA: Visual Memex Question Answering |
Paper |
Dataset |
| VQA |
Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering |
Paper |
Dataset |
| TDIUC |
An analysis of visual question answering algorithms |
Paper |
Dataset |
| TGIF-QA |
TGIF-QA: Toward spatio-temporal reasoning in visual question answering |
Paper |
Dataset |
| MSVD QA, MSRVTT QA |
Video question answering via attribute augmented attention network learning |
Paper |
Dataset |
| YouTube2Text |
Video Question Answering via Gradually Refined Attention over Appearance and Motion |
Paper |
Dataset |
| MovieFIB |
A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering |
Paper |
Dataset |
| Video Context QA |
Uncovering the temporal context for video question answering |
Paper |
Dataset |
| MarioQA |
Marioqa: Answering questions by watching gameplay videos |
Paper |
Dataset |
| TVQA |
Tvqa: Localized, compositional video question answering |
Paper |
Dataset |
| VQA-CP v2 |
Don’t just assume; look and answer: Overcoming priors for visual question answering |
Paper |
Dataset |
| RecipeQA |
RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes |
Paper |
Dataset |
| GQA |
GQA: A new dataset for real-world visual reasoning and compositional question answering |
Paper |
Dataset |
| Social IQ |
Social-iq: A question answering benchmark for artificial social intelligence |
Paper |
Dataset |
| MIMOQA |
MIMOQA: Multimodal Input Multimodal Output Question Answering |
Paper |
- |
- Summarization
| Dataset |
Title of the Paper |
Link of the Paper |
Link of the Dataset |
| SumMe |
Tvsum: Summarizing web videos using titles |
Paper |
Dataset |
| TVSum |
Creating summaries from user videos |
Paper |
Dataset |
| QFVS |
Query-focused video summarization: Dataset, evaluation, and a memory network based approach |
Paper |
Dataset |
| MMSS |
Multi-modal Sentence Summarization with Modality Attention and Image Filtering |
Paper |
- |
| MSMO |
MSMO: Multimodal Summarization with Multimodal Output |
Paper |
- |
| Screen2Words |
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning |
Paper |
Dataset |
| AVIATE |
IEMOCAP: interactive emotional dyadic motion capture database |
Paper |
Dataset |
| Multimodal Microblog Summarizaion |
On Multimodal Microblog Summarization |
Paper |
- |
- Human Computer Interaction
| Dataset |
Title of the Paper |
Link of the Paper |
Link of the Dataset |
| CAUVE |
CUAVE: A new audio-visual database for multimodal human-computer interface research |
Paper |
Dataset |
| MHAD |
Berkeley mhad: A comprehensive multimodal human action database |
Paper |
Dataset |
| Multi-party interactions |
A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction |
Paper |
- |
| MHHRI |
Multimodal human-human-robot interactions (mhhri) dataset for studying personality and engagement |
Paper |
[Dataset]( https://www.cl.cam.ac.uk/research/rainbow/projects/mhhri/) |
| Red Hen Lab |
Red Hen Lab: Dataset and Tools for Multimodal Human Communication Research |
Paper |
- |
| EMRE |
Generating a Novel Dataset of Multimodal Referring Expressions |
Paper |
Dataset |
| Chinese Whispers |
Chinese whispers: A multimodal dataset for embodied language grounding |
Paper |
Dataset |
| uulmMAC |
The uulmMAC database—A multimodal affective corpus for affective computing in human-computer interaction |
Paper |
Dataset |
- Semantic Analysis
| Dataset |
Title of the Paper |
Link of the Paper |
Link of the Dataset |
| WN9-IMG |
Image-embodied Knowledge Representation Learning |
Paper |
Dataset |
| Wikimedia Commons |
A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions |
Paper |
Dataset |
| Starsem18-multimodalKB |
A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning |
Paper |
[Dataset]( https://github.com/UKPLab/starsem18-multimodalKB) |
| MUStARD |
Towards Multimodal Sarcasm Detection |
Paper |
Dataset |
| YouMakeup |
YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension |
Paper |
[Dataset]( https://github.com/AIM3-RUC/YouMakeup) |
| MDID |
Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts |
Paper |
[Dataset]( https://github.com/karansikka1/documentIntent_emnlp19) |
| Social media posts from Flickr (Mental Health) |
Inferring Social Media Users’ Mental Health Status from Multimodal Information |
Paper |
Dataset |
| Twitter MEL |
Building a Multimodal Entity Linking Dataset From Tweets Building a Multimodal Entity Linking Dataset From Tweets |
Paper |
[Dataset]( https://github.com/OA256864/MEL_Tweets) |
| MultiMET |
MultiMET: A Multimodal Dataset for Metaphor Understanding |
Paper |
- |
| MSDS |
Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline |
Paper |
Dataset |
- Miscellaneous
| Dataset |
Title of the Paper |
Link of the Paper |
Link of the Dataset |
| MS COCO |
Microsoft COCO: Common objects in context |
Paper |
Dataset |
| ILSVRC |
ImageNet Large Scale Visual Recognition Challenge |
Paper |
Dataset |
| YFCC100M |
YFCC100M: The new data in multimedia research |
Paper |
Dataset |
| COGNIMUSE |
COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization |
Paper |
Dataset |
| SNAG |
SNAG: Spoken Narratives and Gaze Dataset |
Paper |
Dataset |
| UR-Funny |
UR-FUNNY: A Multimodal Language Dataset for Understanding Humor |
Paper |
Dataset |
| Bag-of-Lies |
Bag-of-Lies: A Multimodal Dataset for Deception Detection |
Paper |
Dataset |
| MARC |
A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks |
Paper |
Dataset |
| MuSE |
MuSE: a Multimodal Dataset of Stressed Emotion |
Paper |
Dataset |
| BabelPic |
Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concept |
Paper |
Dataset |
| Eye4Ref |
Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations |
Paper |
- |
| Troll Memes |
A Dataset for Troll Classification of TamilMemes |
Paper |
Dataset |
| SEMD |
EmoSen: Generating sentiment and emotion controlled responses in a multimodal dialogue system |
Paper |
- |
| Chat talk Corpus |
Construction and Analysis of a Multimodal Chat-talk Corpus for Dialog Systems Considering Interpersonal Closeness |
Paper |
- |
| EMOTyDA |
Towards Emotion-aided Multi-modal Dialogue Act Classification |
Paper |
Dataset |
| MELINDA |
MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification |
Paper |
Dataset |
| NewsCLIPpings |
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media |
Paper |
Dataset |
| R2VQ |
Designing Multimodal Datasets for NLP Challenges |
Paper |
Dataset |
| M2H2 |
M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations |
Paper |
Dataset |