awesome-image-captioning
awesome-image-captioning copied to clipboard
A curated list of image captioning and related area resources. :-)
Awesome Image Captioning
A curated list of image captioning and related area. :-)
Contributing
Please feel free to send me pull requests or email ([email protected]) to add links. Markdown format:
- [Paper Name](link) - Author 1 et al, `Conference Year`. [[code]](link)
Change Log
- May 25 An up-to-date paper list about vision-and-language pre-training is available here.
Table of Contents
-
Papers
- Survey
- Before - 2015 - 2016 - 2017 - 2018 - 2019 - 2020
- Dataset
- Image Captioning Challenge
-
Popular Implementations
- PyTorch
- TensorFlow
- Torch
- Others
Papers
Survey
-
A Comprehensive Survey of Deep Learning for Image Captioning - Hossain M et al,
arXiv preprint 2018
.
Before
-
I2t: Image parsing to text description - Yao B Z et al,
P IEEE 2011
. -
Im2Text: Describing Images Using 1 Million Captioned Photographs - Ordonez V et al,
NIPS 2011
. [project web] -
Deep Captioning with Multimodal Recurrent Neural Networks - Mao J et al,
arXiv preprint 2014
.
2015
CVPR 2015
-
Show and Tell: A Neural Image Caption Generator - Vinyals O et al,
CVPR 2015
. [code] [code] -
Deep Visual-Semantic Alignments for Generating Image Descriptions - Karpathy A et al,
CVPR 2015
. [project web] [code] -
Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation - Chen X et al,
CVPR 2015
. -
Long-term Recurrent Convolutional Networks for Visual Recognition and Description - Donahue J et al,
CVPR 2015
. [code] [project web]
ICCV 2015
-
Guiding the Long-Short Term Memory Model for Image Caption Generation - Jia X et al,
ICCV 2015
. -
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images - Mao J et al,
ICCV 2015
. [code]
NIPS 2015
-
Expressing an Image Stream with a Sequence of Natural Sentences - Park C C et al,
NIPS 2015
. [code]
ICML 2015
-
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention - Xu K et al,
ICML 2015
. [project] [code] [code]
arXiv preprint 2015
-
Order-Embeddings of Images and Language - Vendrov I et al,
arXiv preprint 2015
. [code] -
Generating Images from Captions with Attention - Mansimov E et al,
arXiv preprint 2015
. [code] -
Learning FRAME Models Using CNN Filters for Knowledge Visualization - Lu Y, et al,
arXiv preprint 2015
. [code] -
Aligning where to see and what to tell: image caption with region-based attention and scene factorization - Jin J et al,
arXiv preprint 2015
.
2016
CVPR 2016
-
Image captioning with semantic attention - You Q et al,
CVPR 2016
. [code] -
DenseCap: Fully Convolutional Localization Networks for Dense Captioning - Johnson J et al,
CVPR 2016
. [code] -
What value do explicit high level concepts have in vision to language problems? - Wu Q et al,
CVPR 2016
. -
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data - Lisa Anne Hendricks et al,
CVPR 2016
. [code] -
SPICE: Semantic Propositional Image Caption Evaluation - Anderson P et al,
ECCV 2016
. [code]
ACMMM 2016
-
Image Captioning with Deep Bidirectional LSTMs - Wang C et al,
ACMMM 2016
. [code]
ACL 2016
-
Multimodal Pivots for Image Caption Translation - Hitschler J et al,
ACL 2016
.
arXiv preprint 2016
-
Image Caption Generation with Text-Conditional Semantic Attention - Zhou L et al,
arXiv preprint 2016
. [code] -
DeepDiary: Automatic Caption Generation for Lifelogging Image Streams - Fan C et al,
arXiv preprint 2016
. -
Learning to generalize to new compositions in image understanding - Atzmon Y et al,
arXiv preprint 2016
. -
Generating captions without looking beyond objects - Heuer H et al,
arXiv preprint 2016
. -
Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning - Chen W et al,
arXiv preprint 2016
. [code] -
Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering - Liu H et al,
arXiv preprint 2016
. -
Recurrent Highway Networks with Language CNN for Image Captioning - Gu J et al,
arXiv preprint 2016
.
2017
CVPR 2017
-
Captioning Images with Diverse Objects - Venugopalan S et al,
CVPR 2017
. [code] -
Top-down Visual Saliency Guided by Captions - Ramanishka V et al,
CVPR 2017
. [code] -
Self-Critical Sequence Training for Image Captioning - Steven J et al,
CVPR 2017
. [code] -
Dense Captioning with Joint Inference and Visual Context - Yang L et al,
CVPR 2017
. [code] -
Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition - Yufei W et al,
CVPR 2017
. [code] -
A Hierarchical Approach for Generating Descriptive Image Paragraphs - Krause J et al,
CVPR 2017
. [code] -
Deep Reinforcement Learning-based Image Captioning with Embedding Reward - Ren Z et al,
CVPR 2017
. -
Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects - Ting Y et al,
CVPR 2017
. -
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning - Lu J et al,
CVPR 2017
. [code] -
Attend to You: Personalized Image Captioning with Context Sequence Memory Networks - CC Park et al,
CVPR 2017
. [code] -
SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning - Chen L et al,
CVPR 2017
. [code] -
Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-In-The-Blank Image Captioning - Qing S et al,
CVPR 2017
.
ICCV 2017
-
Areas of Attention for Image Captioning - Pedersoli M et al,
ICCV 2017
. -
Boosting Image Captioning with Attributes - Yao T et al,
ICCV 2017
. -
An Empirical Study of Language CNN for Image Captioning - Gu J et al,
ICCV 2017
. -
Improved Image Captioning via Policy Gradient Optimization of SPIDEr - Liu S et al,
ICCV 2017
. -
Towards Diverse and Natural Image Descriptions via a Conditional GAN - Dai B et al,
ICCV 2017
. [code] -
Paying Attention to Descriptions Generated by Image Captioning Models - Tavakoliy H R et al,
ICCV 2017
. -
Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner - Chen T H et al,
ICCV 2017
. [code]
AAAI 2017
-
Image Caption with Global-Local Attention - Li L et al,
AAAI 2017
. -
Reference Based LSTM for Image Captioning - Chen M et al,
AAAI 2017
. -
Attention Correctness in Neural Image Captioning - Liu C et al,
AAAI 2017
. -
Text-guided Attention Model for Image Captioning - Mun J et al,
AAAI 2017
. [code]
NIPS 2017
-
Contrastive Learning for Image Captioning - Dai B et al,
NIPS 2017
. [code]
TPAMI 2017
-
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge - Vinyals O et al,
TPAMI 2017
. [code]
arXiv preprint 2017
-
MAT: A Multimodal Attentive Translator for Image Captioning - Liu C et al,
arXiv preprint 2017
. -
Actor-Critic Sequence Training for Image Captioning - Zhang L et al,
arXiv preprint 2017
. -
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? - Tanti M et al,
arXiv preprint 2017
. [code] -
Self-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning - Xian Y et al,
arXiv preprint 2017
. -
Phrase-based Image Captioning with Hierarchical LSTM Model - Tan Y H et al,
arXiv preprint 2017
. -
Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning - Chen H et al,
arXiv preprint 2017
.
2018
CVPR 2018
-
Neural Baby Talk - Lu J et al,
CVPR 2018
. [code] -
Convolutional Image Captioning - Aneja J et al,
CVPR 2018
. -
Learning to Evaluate Image Captioning - Cui Y et al,
CVPR 2018
. [code] -
Discriminability Objective for Training Descriptive Captions - Luo R et al,
CVPR 2018
. [code] -
SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text - Mathews A et al,
CVPR 2018
. -
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering - Anderson P et al,
CVPR 2018
. [code] -
GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints
- Chen F et al,
CVPR 2018
.
ECCV 2018
-
Unpaired Image Captioning by Language Pivoting - Gu J et al,
ECCV 2018
. -
Recurrent Fusion Network for Image Captioning - Jiang W et al,
ECCV 2018
. -
Exploring Visual Relationship for Image Captioning - Yao T et al,
ECCV 2018
. -
Rethinking the Form of Latent States in Image Captioning - Dai B et al,
ECCV 2018
. [code] -
Boosted Attention: Leveraging Human Attention for Image Captioning - Chen S et al,
ECCV 2018
. -
"Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention - Chen T et al,
ECCV 2018
.
AAAI 2018
-
Learning to Guide Decoding for Image Captioning - Jiang W et al,
AAAI 2018
. -
Stack-Captioning: Coarse-to-Fine Learning for Image Captioning - Gu J et al,
AAAI 2018
. [code] -
Temporal-difference Learning with Sampling Baseline for Image Captioning - Chen H et al,
AAAI 2018
.
NeurIPS 2018
-
Partially-Supervised Image Captioning - Anderson P et al,
NeurIPS 2018
. -
A Neural Compositional Paradigm for Image Captioning - Dai B et al,
NeurIPS 2018
.
NAACL 2018
-
Defoiling Foiled Image Captions - Wang J et al,
NAACL 2018
. -
Punny Captions: Witty Wordplay in Image Descriptions - Chandrasekaran A et al,
NAACL 2018
. [code] -
Object Counts! Bringing Explicit Detections Back into Image Captioning - Aneja J et al,
NAACL 2018
.
ACL 2018
-
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning - Sharma P et al,
ACL 2018
. [code] -
Attacking visual language grounding with adversarial examples: A case study on neural image captioning - Chen H et al,
ACL 2018
. [code]
EMNLP 2018
-
simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions - Liu et al,
EMNLP 2018
. [code]
arXiv preprint 2018
-
Improved Image Captioning with Adversarial Semantic Alignment - Melnyk I et al,
arXiv preprint 2018
. -
Improving Image Captioning with Conditional Generative Adversarial Nets - Chen C et al,
arXiv preprint 2018
. -
CNN+CNN: Convolutional Decoders for Image Captioning - Wang Q et al,
arXiv preprint 2018
. -
Diverse and Controllable Image Captioning with Part-of-Speech Guidance - Deshpande A et al,
arXiv preprint 2018
.
2019
CVPR 2019
-
Unsupervised Image Captioning - Yang F et al,
CVPR 2019
. [code] -
Engaging Image Captioning Via Personality - Shuster K et al,
CVPR 2019
. -
Pointing Novel Objects in Image Captioning - Li Y et al,
CVPR 2019
. -
Auto-Encoding Scene Graphs for Image Captioning - Yang X et al,
CVPR 2019
. -
Context and Attribute Grounded Dense Captioning - Yin G et al,
CVPR 2019
. -
Look Back and Predict Forward in Image Captioning - Qin Y et al,
CVPR 2019
. -
Self-critical n-step Training for Image Captioning - Gao J et al,
CVPR 2019
. -
Intention Oriented Image Captions with Guiding Objects - Zheng Y et al,
CVPR 2019
. -
Describing like humans: on diversity in image captioning - Wang Q et al,
CVPR 2019
. -
Adversarial Semantic Alignment for Improved Image Captions - Dognin P et al,
CVPR 2019
. -
MSCap: Multi-Style Image Captioning With Unpaired Stylized Text - Gao L et al,
CVPR 2019
. -
Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech - Aditya D et al,
CVPR 2019
. -
Good News, Everyone! Context driven entity-aware captioning for news images - Biten A F et al,
CVPR 2019
. [code] -
CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection - Zhang L et al,
CVPR 2019
. [code] -
Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning - Kim D et al,
CVPR 2019
. [code] -
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions - Cornia M et al,
CVPR 2019
. [code] -
Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables - Xu Y et al,
CVPR 2019
.
AAAI 2019
-
Meta Learning for Image Captioning - Li N et al,
AAAI 2019
. -
Learning Object Context for Dense Captioning - Li X et al,
AAAI 2019
. -
Hierarchical Attention Network for Image Captioning - Wang W et al,
AAAI 2019
. -
Deliberate Residual based Attention Network for Image Captioning - Gao L et al,
AAAI 2019
. -
Improving Image Captioning with Conditional Generative Adversarial Nets - Chen C et al,
AAAI 2019
. -
Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding - Song L et al,
AAAI 2019
.
ACL 2019
-
Dense Procedure Captioning in Narrated Instructional Videos - Shi B et al,
ACL 2019
. -
Informative Image Captioning with External Sources of Information - Zhao S et al,
ACL 2019
. -
Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning - Fan Z et al,
ACL 2019
.
BMVC 2019
-
Image Captioning with Unseen Objects - Demirel et al,
BMVC 2019
. -
Look and Modify: Modification Networks for Image Captioning - Sammani et al,
BMVC 2019
. [code] -
Show, Infer and Tell: Contextual Inference for Creative Captioning - Khare et al,
BMVC 2019
. [code] -
SC-RANK: Improving Convolutional Image Captioning with Self-Critical Learning and Ranking Metric-based Reward - Yan et al,
BMVC 2019
.
ICCV 2019
-
Hierarchy Parsing for Image Captioning - Yao T et al,
ICCV 2019
. -
Entangled Transformer for Image Captioning - Li G et al,
ICCV 2019
. -
Attention on Attention for Image Captioning - Huang L et al,
ICCV 2019
. [code] -
Reflective Decoding Network for Image Captioning - Ke L at al,
ICCV 2019
. -
Learning to Collocate Neural Modules for Image Captioning - Yang X et al,
ICCV 2019
.
NeurIPS 2019
-
Image Captioning: Transforming Objects into Words - Herdade S et al,
NeurIPS 2019
. -
Adaptively Aligned Image Captioning via Adaptive Attention Time - Huang L et al,
NeurIPS 2019
. [code] -
Variational Structured Semantic Inference for Diverse Image Captioning - Chen F et al,
NeurIPS 2019
. -
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations - Liu F et al,
NeurIPS 2019
. [code]
IJCAI 2019
-
Image Captioning with Compositional Neural Module Networks - Tian J et al,
IJCAI 2019
. -
Exploring and Distilling Cross-Modal Information for Image Captioning - Liu F et al,
IJCAI 2019
. -
Swell-and-Shrink: Decomposing Image Captioning by Transformation and Summarization - Wang H et al,
IJCAI 2019
. -
Hornet: a hierarchical offshoot recurrent network for improving person re-ID via image captioning - Yan S et al,
IJCAI 2019
.
EMNLP 2019
-
Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach - Kim D J et al,
EMNLP 2019
. -
TIGEr: Text-to-Image Grounding for Image Caption Evaluation - Jiang M et al,
EMNLP 2019
. -
REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning - Jiang M et al,
EMNLP 2019
. -
Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering - Changpinyo S et al,
EMNLP 2019
.
CoNLL 2019
-
Compositional Generalization in Image Captioning - Nikolaus M et al,
CoNLL 2019
. [code]
2020
AAAI 2020
-
MemCap: Memorizing Style Knowledge for Image Captioning - Zhao et al,
AAAI 2020
. -
Unified Vision-Language Pre-Training for Image Captioning and VQA - Zhou L et al,
AAAI 2020
. -
Show, Recall, and Tell: Image Captioning with Recall Mechanism - Wang L et al,
AAAI 2020
. -
Reinforcing an Image Caption Generator using Off-line Human Feedback - Hongsuck Seo P et al,
AAAI 2020
. -
Interactive Dual Generative Adversarial Networks for Image Captioning - Liu et al,
AAAI 2020
. -
Feature Deformation Meta-Networks in Image Captioning of Novel Objects - Cao et al,
AAAI 2020
. -
Joint Commonsense and Relation Reasoning for Image and Video Captioning - Hou et al,
AAAI 2020
. -
Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network
for Personalized Image Caption - Zhang et al,
AAAI 2020
.
CVPR 2020
-
Normalized and Geometry-Aware Self-Attention Network for Image Captioning - Guo L et al,
CVPR 2020
. -
Object Relational Graph with Teacher-Recommended Learning for Video Captioning - Zhang Z et al,
CVPR 2020
. -
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs - Chen S et al,
CVPR 2020
. -
X-Linear Attention Networks for Image Captioning - Pan et al,
CVPR 2020
.
ACL 2020
-
Improving Image Captioning with Better Use of Caption - Shi Z et al,
ACL 2020
. -
Cross-modal Coherence Modeling for Caption Generation - Alikhani M et al,
ACL 2020
. -
Improving Image Captioning Evaluation by Considering Inter References Variance - Yi Y et al,
ACL 2020
. -
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning - Lei J et al,
ACL 2020
. -
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA - Kim H et al,
ACL 2020
.
ECCV 2020
-
Length-Controllable Image Captioning - Deng C et al,
ECCV 2020
. -
Captioning Images Taken by People Who Are Blind - Gurari D et al,
ECCV 2020
. -
Towards Unique and Informative Captioning of Images - Wang Z et al,
ECCV 2020
. -
Learning Visual Representations with Caption Annotations - Sariyildiz M et al,
ECCV 2020
. -
Comprehensive Image Captioning via Scene Graph Decomposition - Zhong Y et al,
ECCV 2020
. -
SODA: Story Oriented Dense Video Captioning Evaluation Framework - Fujita S et al,
ECCV 2020
. -
TextCaps: a Dataset for Image Captioning with Reading Comprehension - Sidorov O et al,
ECCV 2020
. -
Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets - Wang J et al,
ECCV 2020
. -
Learning to Generate Grounded Visual Captions without Localization Supervision - Ma C et al,
ECCV 2020
. -
Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards - Yang X et al,
ECCV 2020
. -
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos - Chen S et al,
ECCV 2020
.
EMNLP 2020
-
CapWAP: Image Captioning with a Purpose - Fisch A et al,
EMNLP 2020
. -
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers - Cho J et al,
EMNLP 2020
. -
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning - Fang Z et al,
EMNLP 2020
. -
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements - Li Y et al,
EMNLP 2020
.
NeurIPS 2020
-
Diverse Image Captioning with Context-Object Split Latent Spaces - Mahajan S et al,
NeurIPS 2020
. -
RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning - Chiaro R et al,
NeurIPS 2020
.
Dataset
-
nocaps, LANG:
English
-
MS COCO, LANG:
English
. -
Flickr 8k, LANG:
English
. -
Flickr 30k, LANG:
English
. -
AI Challenger, LANG:
Chinese
. -
Visual Genome, LANG:
English
. -
SBUCaptionedPhotoDataset, LANG:
English
. -
IAPR TC-12, LANG:
English, German and Spanish
.
Image Captioning Challenge
Popular Implementations
PyTorch
TensorFlow
Torch
Others
Licenses
To the extent possible under law, Zhihong Chen has waived all copyright and related or neighboring rights to this work.