awesome-vision-and-language
awesome-vision-and-language copied to clipboard
A curated list of awesome vision and language resources (still under construction... stay tuned!)
Awesome Vision-and-Language: 
A curated list of awesome vision and language resources, inspired by awesome-computer-vision.
Table Of Contents
- Survey
- Dataset
- Image Captioning
- Image Retrieval
- Scene Text Recognition (OCR)
- Scene Graph
- text2image
- Video Captioning
- Video Question Answering
- Video Understanding
- Vision and Language Navigation
- Vision and Language Pretraining
- Visual Dialog
- Visual Grounding
- Visual Question Answering (VQA)
- Visual Reasoning
- Visual Relationship Detection
- Visual Storytelling
Survey
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| A Survey of Current Datasets for Vision and Language Research | 2015 EMNLP | 1506.06833 | ||
| Multimodal Machine Learning: A Survey and Taxonomy | 1705.09406 | |||
| A Comprehensive Survey of Deep Learning for Image Captioning | 1810.04020 | |||
| Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods | 1907.09358 | |||
| A Survey of Scene Graph Generation and Application | Scene-Graph-Survey | |||
| Challenges and Prospects in Vision and Language Research | 1904.09317 | |||
| Deep Multimodal Representation Learning: A Survey | 2019 ACCESS | ACCESS 2019 | ||
| Multimodal Intelligence: Representation Learning, Information Fusion, and Applications | 1911.03977 | |||
| Vision and Language: from Visual Perception to Content Creation | 2020 APSIPA | 1912.11872 | ||
| Multimodal Research in Vision and Language: A Review of Current and Emerging Trends | 2010.09522 | |||
Dataset
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| VQA: Visual Question Answering | 2015 ICCV | 1505.00468 | visualqa | |
| Visual Storytelling | 2016 NAACL | 1604.03968 | ai-visual-storytelling-seq2seq | VIST |
| Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations | 2017 IJCV | 1602.07332 | visual_genome_python_driver | visualgenome |
| CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning | 2017 CVPR | 1612.06890 | ||
| AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions | 2018 CVPR | 1705.08421 | AVA | |
| Embodied Question Answering | 2018 CVPR | 1711.11543 | embodiedqa | |
| Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments | 2018 CVPR | 1711.07280 | bringmeaspoon | |
| GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering | 2019 CVPR | 1902.09506 | visualreasoning | |
| From Recognition to Cognition: Visual Commonsense Reasoning | 2019 CVPR | 1811.10830 | r2c | VCR |
| VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research | 2019 ICCV | 1904.03493 | ||
| Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning | 2020 NeurIPS | 2010.00763 | Bongard-LOGO | |
| Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions | 2022 CVPR | 2205.13803 | Bongard-HOI | |
Image Captioning
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Long-term Recurrent Convolutional Networks for Visual Recognition and Description | 2015 CVPR | 1411.4389 | ||
| Deep Visual-Semantic Alignments for Generating Image Descriptions | 2015 CVPR | 1412.2306 | ||
| Show and Tell A Neural Image Caption Generator | 2015 CVPR | 1411.4555 | show_and_tell.tensorflow | |
| Show, Attend and Tell Neural Image Caption Generation with Visual Attention | 2015 ICML | 1502.03044 | show-attend-and-tell | |
| From Captions to Visual Concepts and Back | 2015 CVPR | 1411.4952 | visual-concepts | |
| Image Captioning with Semantic Attention | 2016 CVPR | 1603.03925 | semantic-attention | |
| Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning | 2017 CVPR | 1612.01887 | AdaptiveAttention | |
| Self-critical Sequence Training for Image Captioning | 2017 CVPR | 1612.00563 | ||
| A Hierarchical Approach for Generating Descriptive Image Paragraphs | 2017 CVPR | 1611.06607 | ||
| Deep reinforcement learning-based image captioning with embedding reward | 2017 CVPR | 1704.03899 | ||
| Semantic compositional networks for visual captioning | 2017 CVPR | 1611.08002 | Semantic_Compositional_Nets | |
| StyleNet: Generating Attractive Visual Captions with Styles | 2017 CVPR | CVPR 2017 | stylenet | |
| Training for Diversity in Image Paragraph Captioning | 2018 EMNLP | ENNLP 2018 | image-paragraph-captioning | |
| Neural Baby Talk | 2018 CVPR | 1803.09845 | NeuralBabyTalk | |
| Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | 2018 CVPR | 1707.07998 | ||
| “Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention | 2018 ECCV | 1807.03871 | ||
| Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation | 2019 AAAI | 1805.08191 | ||
| Unsupervised Image Captioning | 2019 CVPR | 1811.10787 | unsupervised_captioning | |
| Context-aware visual policy network for fine-grained image captioning | 2019 TPAMI | 1906.02365 | CAVP | |
| Dense Relational Captioning Triple-Stream Networks for Relationship-Based Captioning | 2019 CVPR | 1903.05942 | ||
| Describing like Humans on Diversity in Image Captioning | 2019 CVPR | 1903.12020 | ||
| Good News, Everyone! Context driven entity-aware captioning for news images | 2019 CVPR | 1904.01475 | ||
| Auto-Encoding Scene Graphs for Image Captioning | 2019 CVPR | 1812.02378 | SGAE | |
| Unsupervised Image Captioning | 2019 CVPR | 1811.10787 | unsupervised_captioning | |
| MSCap: Multi-Style Image Captioning with Unpaired Stylized Text | 2019 CVPR | CVPR 2019 | ||
| Robust Change Captioning | 2019 ICCV | 1901.02527 | ||
| Attention on Attention for Image Captioning | 2019 ICCV | 1908.06954 | ||
| Context-Aware Group Captioning via Self-Attention and Contrastive Features | 2020 CVPR | 2004.03708 | ||
| Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs | 2020 CVPR | 2003.00387 | asg2cap | |
| Comprehensive Image Captioning via Scene Graph Decomposition | 2020 ECCV | 2007.11731 | Sub-GC | |
| Are scene graphs good enough to improve Image Captioning? | 2020 AACL | 2009.12313 | ||
| SG2Caps: Revisiting Scene Graphs for Image Captioning | 2021 arxiv | 2102.04990 | ||
Image Retrieval
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Visual Word2Vec (vis-w2v) Learning Visually Grounded Word Embeddings Using Abstract Scenes | 2016 CVPR | 1511.07067 | VisualWord2Vec | |
| Composing Text and Image for Image Retrieval - An Empirical Odyssey | 2019 CVPR | 1812.07119 | tirg | |
| Learning Relation Alignment for Calibrated Cross-modal Retrieval | 2021 ACL | 2105.13868 | IAIS | |
| ImageCoDe: Image Retrieval from Contextual Descriptions | 2022 ACL | 2203.15867 | ImageCoDe | |
Scene Text Recognition
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Towards Unconstrained End-to-End Text Spotting | 2019 ICCV | 1908.09231 | ||
| What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis | 2019 ICCV | 1904.01906 | clovaai | |
Scene Graph
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Image Retrieval Using Scene Graphs | 2015 CVPR | 7298990 | ||
| Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations | 2017 IJCV | 1602.07332 | visual_genome_python_driver | visualgenome |
| Scene Graph Generation by Iterative Message Passing | 2017 CVPR | 1701.02426 | scene-graph-TF-release | |
| Scene Graph Generation from Objects, Phrases and Region Captions | 2017 ICCV | 1707.09700 | MSDN | |
| Neural Motifs: Scene Graph Parsing with Global Context | 2018 CVPR | 1711.06640 | neural-motifs | |
| Generating Triples with Adversarial Networks for Scene Graph Construction | 2018 AAAI | 1802.02598 | ||
| LinkNet: Relational Embedding for Scene Graph | 2018 NIPS | 1811.06410 | ||
| Image Generation from Scene Graphs | 2018 CVPR | 1804.01622 | sg2im | |
| Graph R-CNN for Scene Graph Generation | 2018 ECCV | 1808.00191 | graph-rcnn.pytorch | |
| Scene Graph Generation with External Knowledge and Image Reconstruction | 2019 CVPR | 1904.00560 | ||
| Specifying Object Attributes and Relations in Interactive Scene Generation | 2019 ICCV | 1909.05379 | scene_generation | |
| Attentive Relational Networks for Mapping Images to Scene Graphs | 2019 CVPR | 1811.10696 | ||
| Exploring Context and Visual Pattern of Relationship for Scene Graph Generation | 2019 CVPR | sceneGraph_Mem | ||
| Graphical Contrastive Losses for Scene Graph Parsing | 2019 CVPR | 1903.02728 | ContrastiveLosses4VRD | |
| Knowledge-Embedded Routing Network for Scene Graph Generation | 2019 CVPR | 1903.03326 | KERN | |
| Learning to Compose Dynamic Tree Structures for Visual Contexts | 2019 CVPR | 1812.01880 | VCTree | |
| Counterfactual Critic Multi-Agent Training for Scene Graph Generation | 2019 ICCV | 1812.02347 | ||
| Scene Graph Prediction with Limited Labels | 2019 ICCV | 1904.11622 | limited-label | |
| Unbiased Scene Graph Generation from Biased Training | 2020 CVPR | 2002.11949 | Scene-Graph-Benchmark | |
| GPS-Net Graph Property Sensing Network for Scene Graph Generation | 2020 CVPR | 2003.12962 | GPS-Net | |
| Learning Visual Commonsense for Robust Scene Graph Generation | 2020 ECCV | 2006.09623 | ||
| Sketching Image Gist Human-Mimetic Hierarchical Scene Graph Generation | 2020 ECCV | 2007.08760 | het-eccv20 | |
text2image
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Generative Adversarial Text to Image Synthesis | 2016 ICML | 1605.05396 | icml2016 | |
| StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks | 2017 ICCV | 1612.03242 | StackGAN | |
| AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks | 2018 CVPR | 1711.10485 | AttnGAN | |
| Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network | 2018 CVPR | 1802.09178 | HDGan | |
| StoryGAN: A Sequential Conditional GAN for Story Visualization | 2019 CVPR | 1812.02784 | StoryGAN | |
| MirrorGAN: Learning Text-to-image Generation by Redescription | 2019 CVPR | 1903.05854 | ||
| DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis | 2019 CVPR | 1904.01310 | ||
| Semantics Disentangling for Text-to-Image Generation | 2019 CVPR | 1904.01480 | ||
| Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction | 2019 ICCV | 1811.09845 | GeNeVA | |
| Specifying Object Attributes and Relations in Interactive Scene Generation | 2019 ICCV | 1909.05379 | scene_generation | |
Video Captioning
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Long-term Recurrent Convolutional Networks for Visual Recognition and Description | 2015 CVPR | 1411.4389 | ||
| Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks | 2016 CVPR | 1510.07712 | ||
| Attention-Based Multimodal Fusion for Video Description | 2017 CVPR | 1701.03126 | ||
| Semantic compositional networks for visual captioning | 2017 CVPR | 1611.08002 | ||
| Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description | 2017 CVPR | CVPR_2017 | ||
| Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning | 2018 CVPR | 1804.00100 | ||
| Adversarial Inference for Multi-Sentence Video Description | 2019 CVPR | 1812.05634 | adv-inf | |
| Streamlined Dense Video Captioning | 2019 CVPR | 1904.03870 | DenseVideoCaptioning | |
| Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning | 2019 CVPR | 1906.04375 | ||
| iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering | 2021 WACV | 2011.07735 | iPerceive | |
Video Question Answering
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Movieqa: Understanding stories in movies through question-answering | 2016 CVPR | 1512.02902 | MovieQA | |
| TVQA: Localized, Compositional Video Question Answering | 2018 EMNLP | 1809.01696 | TVQA | |
| Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions | 2020 ECCV | 2007.08751 | ROLL-VideoQA | |
| iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering | 2021 WACV | 2011.07735 | iPerceive |
Video Understanding
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| TSM: Temporal Shift Module for Efficient Video Understanding | 2019 ICCV | 1811.08383 | temporal-shift-module | |
| A Graph-Based Framework to Bridge Movies and Synopses | 2019 ICCV | 1910.11009 | ||
Vision and Language Navigation
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Embodied Question Answering | 2018 CVPR | 1711.11543 | embodiedqa | |
| Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments | 2018 CVPR | 1711.07280 | bringmeaspoon | |
Vision-and-Language Pretraining
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| LXMERT: Learning Cross-Modality Encoder Representations from Transformers | 2019 EMNLP | 1908.07490 | lxmert | |
| VideoBERT: A Joint Model for Video and Language Representation Learning | 2019 ICCV | 1904.01766 | ||
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | 2019 NIPS | vilbert | ||
| OmniNet: A unified architecture for multi-modal multi-task learning | 2019 arxiv | 1907.07804 | OmniNet | |
| Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training | 2020 AAAI | 1908.06066 | Unicoder | |
| Unified Vision-Language Pre-Training for Image Captioning and VQA | 2020 AAAI | 1909.11059 | VLP | |
| Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks | 2020 ECCV | 1911.11237 | Oscar | |
| Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | 2020 NIPS | 2006.09882 | swav | |
| Learning to Learn Words from Visual Scenes | 2020 ECCV | 2004.06165 | Oscar | |
| ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs | 2021 AAAI | 2006.16934 | ERNIE | |
| VinVL: Revisiting Visual Representations in Vision-Language Models | 2021 CVPR | 2101.00529 | VinVL | |
| VirTex: Learning Visual Representations from Textual Annotations | 2021 CVPR | 2006.06666 | virtex | |
| Learning Transferable Visual Models From Natural Language Supervision | 2021 arxiv | 2103.00020 | ||
| Pretrained Transformers As Universal Computation Engines | 2021 arxiv | 2103.05247 | universal-computation | |
| Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | 2021 arxiv | 2102.05918 | ||
| Self-supervised Pretraining of Visual Features in the Wild | 2021 arxiv | 2103.01988 | ||
| Transformer is All You Need Multimodal Multitask Learning with a Unified Transformer | 2021 arxiv | 2102.10772 | ||
| Zero-Shot Text-to-Image Generation | 2021 arxiv | 2102.12092 | ||
| WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training | 2021 arxiv | 2103.06561 | ||
Visual Dialog
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Visual Dialog | 2017 CVPR | 1611.08669 | visdial | visualdialog |
| Two Can Play This Game: Visual Dialog With Discriminative Question Generation and Answering | 2018 CVPR | 1803.11186 | ||
Visual Grounding
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Modeling Relationships in Referential Expressions with Compositional Modular Networks | 2017 CVPR | 1611.09978 | cmn | |
| Phrase Localization Without Paired Training Examples | 2019 ICCV | 1908.07553 | ||
| Learning to Assemble Neural Module Tree Networks for Visual Grounding | 2019 ICCV | 1812.03299 | ||
| A Fast and Accurate One-Stage Approach to Visual Grounding | 2019 ICCV | 1908.06354 | ||
| Zero-Shot Grounding of Objects from Natural Language Queries | 2019 ICCV | 1908.07129 | zsgnet | |
Visual Question Answering
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| VQA: Visual Question Answering | 2015 ICCV | 1505.00468 | visualqa | |
| Hierarchical question-image co-attention for visual question answering | 2016 NIPS | 1606.00061 | HieCoAttenVQA | |
| Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding | 2016 EMNLP | 1606.01847 | vqa-mcb | |
| Stacked Attention Networks for Image Question Answering | 2016 CVPR | 1511.02274 | imageqa-san | |
| Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering | 2016 ECCV | 1511.05234 | AAAA | |
| Dynamic Memory Networks for Visual and Textual Question Answering | 2016 ICML | 1603.01417 | dmn-plus | |
| Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding | 2016 EMNLP | 1606.01847 | vqa-mcb | |
| Multimodal Residual Learning for Visual QA | 2016 NIPS | 1606.01455 | nips-mrn-vqa | |
| Graph-Structured Representations for Visual Question Answering | 2017 CVPR | 1609.05600 | ||
| Making the V in VQA Matter Elevating the Role of Image Understanding in Visual Question Answering | 2017 CVPR | 1612.00837 | ||
| Learning to Reason: End-to-End Module Networks for Visual Question Answering | 2017 ICCV | 1704.05526 | ||
| Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering | 2018 AAAI | 1803.08896 | PSLQA | |
| Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | 2018 CVPR | 1707.07998 | ||
| Tips and Tricks for Visual Question Answering Learnings from the 2017 Challenge | 2018 CVPR | 1708.02711 | vqa-winner | |
| Transfer Learning via Unsupervised Task Discovery for Visual Question Answering | 2019 CVPR | 1810.02358 | VQA-Transfer-ExternalData | |
| GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering | 2019 CVPR | 1902.09506 | visualreasoning | |
| Towards VQA Models That Can Read | 2019 CVPR | 1904.08920 | ||
| From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason | 2019 ICCV | ICCV2019 | ||
| An Empirical Study on Leveraging Scene Graphs for Visual Question Answering | 2019 BMVC | 1907.12133 | scene-graphs-vqa | |
| RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning | 2022 ICLR | 2204.11167 | RelViT | |
| TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation | 2022 arXiv | 2208.01813 | TAG | |
Visual Reasoning
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning | 2017 CVPR | 1612.06890 | ||
| Inferring and Executing Programs for Visual Reasoning | 2017 ICCV | 1705.03633 | ||
| GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering | 2019 CVPR | 1902.09506 | visualreasoning | |
| Explainable and Explicit Visual Reasoning over Scene Graphs | 2019 CVPR | 1812.01855 | ||
| From Recognition to Cognition: Visual Commonsense Reasoning | 2019 CVPR | 1811.10830 | r2c | VCR |
| Dynamic Graph Attention for Referring Expression Comprehension | 2019 ICCV | 1909.08164 | ||
| Visual Semantic Reasoning for Image-Text Matching | 2019 ICCV | 1909.02701 | VSRN | |
| Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning | 2020 NeurIPS | 2010.00763 | Bongard-LOGO | |
| Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions | 2022 CVPR | 2205.13803 | Bongard-HOI | |
| RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning | 2022 ICLR | 2204.11167 | RelViT | |
Visual Relationship Detection
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Visual Relationship Detection with Language Priors | 2016 ECCV | 1608.00187 | Visual-Relationship-Detection | |
| ViP-CNN: Visual Phrase Guided Convolutional Neural Network | 2017 CVPR | 1702.07191 | ||
| Visual Translation Embedding Network for Visual Relation Detection | 2017 CVPR | 1702.08319 | drnet | |
| Deep Variation-structured Reinforcement Learning for Visual Relationship and Attribute Detection | 2017 CVPR | 1703.03054 | DeepVariationRL | |
| Detecting Visual Relationships with Deep Relational Networks | 2017 CVPR | 1704.03114 | drnet | |
| Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues | 2017 ICCV | 1611.06641 | pl-clc | |
| Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation | 2017 ICCV | 1707.09423 | ||
| Referring Relationships | 2018 CVPR | 1803.10362 | ReferringRelationships | |
| Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition | 2018 ECCV | 1807.04979 | ZoomNet | |
| Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features | 2018 ECCV | 1808.00171 | vrd | |
| Leveraging Auxiliary Text for Deep Recognition of Unseen Visual Relationships | 2020 ICLR | 1910.12324 | ||
Visual Storytelling
| Title | Conference / Journal | Paper | Code | Remarks |
|---|---|---|---|---|
| Visual Storytelling | 2016 NAACL | 1604.03968 | visual_genome_python_driver | VIST |
| No Metrics Are Perfect Adversarial Reward Learning for Visual Storytelling | 2018 ACL | 1804.09160 | AREL | |
| Show, Reward and Tell: Automatic Generation of Narrative Paragraph from Photo Stream by Adversarial Training | 2018 AAAI | |||
| Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling | 2020 AAAI | 2002.00774 | ||
| Storytelling from an Image Stream Using Scene Graphs | 2020 AAAI | AAAI 2020 | ||
Contributing
Please feel free to send me pull requests or email ([email protected]) to add links.
Licenses
License
To the extent possible under law, Sangmin Woo has waived all copyright and related or neighboring rights to this work.
