cross-modal-retrieval topic
clip-as-service
🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
xmodaler
X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense r...
pvse
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019)
Awesome_Matching_Pretraining_Transfering
The Paper List of Large Multi-Modality Model, Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
SGRAF
[AAAI2021] The code of “Similarity Reasoning and Filtration for Image-Text Matching”
vse_infty
Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021
muscall
Official implementation of "Contrastive Audio-Language Learning for Music" (ISMIR 2022)
VLDeformer
Pytorch implement of the paper "VLDeformer: Vision Language Decomposed Transformer for Fast Cross-modal Retrieval", KBS 2022
TextReID
[BMVC 2021] Text-Based Person Search with Limited Data
objects-that-sound
The unofficial implementation of paper, "Objects that Sound", from ECCV 2018.