vision-and-language topic
lnfmm
Latent Normalizing Flows for Many-to-Many Cross Domain Mappings (ICLR 2020)
SpaCap3D
[IJCAI 2022] Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds (official pytorch implementation)
Awesome-Colorful-LLM
Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics, and Fundamental Sciences such as Mathematics.
STL-VQA
The good practice in the VQA system such as pos-tag attention, structed triplet learning and triplet attention is very general and can be inserted into almost any visual and language task
CPL
Official implementation of our EMNLP 2022 paper "CPL: Counterfactual Prompt Learning for Vision and Language Models"
clip-openness
[ACL 2023] Delving into the Openness of CLIP
Vote2Cap-DETR
[T-PAMI 2024] & [CVPR 2023] Vote2Cap-DETR; A set-to-set perspective towards 3D Dense Captioning; State-of-the-Art 3D Dense Captioning methods
Aerial-Vision-and-Dialog-Navigation
Codebase of ACL 2023 Findings "Aerial Vision-and-Dialog Navigation"
TGN
Tensorflow Reproduction of the EMNLP-2018 paper "Temporally Grounding Natural Sentence in Video"
awesome-vqa-latest
Visual Question Answering Paper List.