awesome-visual-grounding icon indicating copy to clipboard operation
awesome-visual-grounding copied to clipboard

awesome visual grounding: a curated list of research papers in referring visual grounding

Awesome Visual Grounding

A curated list of research papers in grounding. Link to the code if available is also present.

Visual grounding task refers to localizing an object with a bounding-box or pixel-level mask given a query or a sentence. It is also called referring expression comprehension. Referring expression is basically uniquely identifying the object in question.

Datasets

Image Grounding Datasets

  1. Flickr30k: Plummer, Bryan A., et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Proceedings of the IEEE international conference on computer vision. 2015. [Paper] [Code] [Website]

  2. RefClef: Kazemzadeh, Sahar, et al. Referitgame: Referring to objects in photographs of natural scenes. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. [Paper] [Website]

  3. RefCOCOg: Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]

  4. RefCOCO and RefCOCO+: 1. Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]

  5. Visual Genome: Krishna, Ranjay, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123.1 (2017): 32-73. [Paper] [Website]

Instructions on RefClef, RefCOCO, RefCOCO+, RefCOCOg is nicely summarized here: https://github.com/lichengunc/refer

Video Datasets

  1. TaCoS: Regneri, Michaela, et al. Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics 1 (2013): 25-36. [Paper] [Website]

  2. Charades: Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Website]

  3. Charades-STA: Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017).[Paper] [Code]

  4. Distinct Describable Moments (DiDeMo): Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code] [Website]

  5. ActivityNet Captions: Krishna, Ranjay, et al. Dense-captioning events in videos. Proceedings of the IEEE International Conference on Computer Vision. 2017. [Paper] [Website]

  6. Charades-Ego: [Website]

    • Sigurdsson, Gunnar, et al. Actor and Observer: Joint Modeling of First and Third-Person Videos. CVPR-IEEE Conference on Computer Vision & Pattern Recognition. 2018. [Paper] [Code]
    • Sigurdsson, Gunnar A., et al. "Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos." arXiv preprint arXiv:1804.09626 (2018). [Paper] [Code]
  7. TEMPO: Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]

Paper Roadmap (Chronological Order):

Visual Grounding / Referring Expressions (Images):

  1. Karpathy, Andrej, Armand Joulin, and Li F. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems. 2014. [Paper]

  2. Karpathy, Andrej, and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. Method name: Neural Talk. [Paper] [Code] [Torch Code] [Website]

  3. Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]

  4. Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]

  5. Wang, Liwei, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]

  6. Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]

  7. Nagaraja, Varun K., Vlad I. Morariu, and Larry S. Davis. Modeling context between objects for referring expression understanding. European Conference on Computer Vision. Springer, Cham, 2016.[Paper] [Code]

  8. Rohrbach, Anna, et al. Grounding of textual phrases in images by reconstruction. European Conference on Computer Vision. Springer, Cham, 2016. Method Name: GroundR [Paper] [Tensorflow Code] [Torch Code]

  9. Wang, Mingzhe, et al. Structured matching for phrase localization. European Conference on Computer Vision. Springer, Cham, 2016. Method name: Structured Matching [Paper] [Code]

  10. Hu, Ronghang, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Website]

  11. Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP (2016). Method name: MCB [Paper][Code]

  12. Endo, Ko, et al. An attention-based regression model for grounding textual phrases in images. Proc. IJCAI. 2017. [Paper]

  13. Chen, Kan, et al. MSRC: Multimodal spatial regression with semantic context for phrase grounding. International Journal of Multimedia Information Retrieval 7.1 (2018): 17-28. [Paper -Springer Link]

  14. Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]

  15. Yu, Licheng, et al. A joint speakerlistener-reinforcer model for referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code][Website]

  16. Hu, Ronghang, et al. Modeling relationships in referential expressions with compositional modular networks. Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017. [Paper] [Code]

  17. Luo, Ruotian, and Gregory Shakhnarovich. Comprehension-guided referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code]

  18. Liu, Jingyu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. Proceedings of CVPR. 2017. [Paper]

  19. Xiao, Fanyi, Leonid Sigal, and Yong Jae Lee. Weakly-supervised visual grounding of phrases with linguistic structures. arXiv preprint arXiv:1705.01371 (2017). [Paper]

  20. Plummer, Bryan A., et al. Phrase localization and visual relationship detection with comprehensive image-language cues. Proc. ICCV. 2017. [Paper] [Code]

  21. Chen, Kan, Rama Kovvuri, and Ram Nevatia. Query-guided regression network with context policy for phrase grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: QRC [Paper] [Code]

  22. Liu, Chenxi, et al. Recurrent Multimodal Interaction for Referring Image Segmentation. ICCV. 2017. [Paper] [Code]

  23. Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]

  24. Li, Xiangyang, and Shuqiang Jiang. Bundled Object Context for Referring Expressions. IEEE Transactions on Multimedia (2018). [Paper ieee link]

  25. Yu, Zhou, et al. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. arXiv preprint arXiv:1805.03508 (2018). [Paper] [Code]

  26. Yu, Licheng, et al. Mattnet: Modular attention network for referring expression comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. [Paper] [Code] [Website]

  27. Deng, Chaorui, et al. Visual Grounding via Accumulated Attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper]

  28. Zhang, Yundong, Juan Carlos Niebles, and Alvaro Soto. Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining. arXiv preprint arXiv:1808.00265 (2018). [Paper]

  29. Chen, Kan, Jiyang Gao, and Ram Nevatia. Knowledge aided consistency for weakly supervised phrase grounding. arXiv preprint arXiv:1803.03879 (2018). [Paper] [Code]

  30. Zhang, Hanwang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code]

  31. Cirik, Volkan, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Using syntax to ground referring expressions in natural images. arXiv preprint arXiv:1805.10547 (2018).[Paper] [Code]

  32. Margffoy-Tuay, Edgar, et al. Dynamic multimodal instance segmentation guided by natural language queries. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]

  33. Plummer, Bryan A., et al. Conditional image-text embedding networks. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]

  34. Akbari, Hassan, et al. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. arXiv preprint arXiv:1811.11683 (2018). [Paper]

  35. Kovvuri, Rama, and Ram Nevatia. PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding. arXiv preprint arXiv:1812.03213 (2018). [Paper]

  36. Liu, Daqing, et al. Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding. arXiv preprint arXiv:1812.03299 (2018). [Paper]

  37. Chen, Xinpeng, et al. Real-Time Referring Expression Comprehension by Single-Stage Grounding Network. arXiv preprint arXiv:1812.03426 (2018). [Paper]

  38. Wang, Peng, et al. Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks. arXiv preprint arXiv:1812.04794 (2018). [Paper]

  39. Deng, Chaorui, et al. You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding. arXiv preprint arXiv:1902.04213 (2019). [Paper]

  40. Ye, Linwei, et al. Cross-Modal Self-Attention Network for Referring Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]

Natural Language Object Retrieval (Images)

  1. Guadarrama, Sergio, et al. Open-vocabulary Object Retrieval. Robotics: science and systems. Vol. 2. No. 5. 2014. [Paper] [Code]

  2. Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]

  3. Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]

  4. Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]

  5. Nguyen, Anh, et al. Object Captioning and Retrieval with Natural Language. arXiv preprint arXiv:1803.06152 (2018). [Paper] [Website]

  6. Plummer, Bryan A., et al. Open-vocabulary Phrase Detection. arXiv preprint arXiv:1811.07212 (2018). [Paper] [Code]

Video Grounding (Activity Localization) using Natural Language:

  1. Yu, Haonan, et al. Grounded Language Learning from Video Described with Sentences Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2013. [Paper]

  2. Xu, Ran, et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. Proceedings of the AAAI Conference on Artificial Intelligence. 2015. [Paper]

  3. Song, Young Chol, et al. Unsupervised Alignment of Actions in Video with Text Descriptions Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2016. [Paper]

  4. Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017). Method name: TALL [Paper] [Code]

  5. Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code]

  6. Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. Video Object Segmentation with Language Referring Expressions. arXiv preprint arXiv:1803.08006 (2018). [Paper] [Website]

  7. Xu, Huijuan, et al. Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning. arXiv preprint arXiv:1804.05113 (2018). [Paper] [Code]

  8. Liu, Bingbin, et al. Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos. European Conference on Computer Vision. Springer, Cham, 2018. [Paper] [Website]

  9. Liu, Meng, et al. Attentive Moment Retrieval in Videos. Proceedings of the International ACM SIGIR Conference . 2018. [Paper] [Website]

  10. Chen, Jingyuan, et al. Temporally grounding natural sentence in video. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. [Paper]

  11. Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]

  12. Zhang, Da, et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. arXiv preprint arXiv:1812.00087 (2018). [Paper]

  13. Ge, Runzhou, et al. MAC: Mining Actiivity Concepts for Language-based Temporal Localization. arXiv preprint arXiv:1811.08925 (2018). [Paper] [Code]

  14. Xu, Huijuan, et al. Joint Event Detection and Description in Continuous Video Streams. arXiv preprint arXiv:1802.10250 (2018). [Paper] [Code]

  15. He, Dongliang, et al. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. Proceedings of the AAAI Conference on Artificial Intelligence. 2019. [Paper]