How to do cross-modal retrieval?

Open carlthome opened this issue 3 years ago • 2 comments

I'm curious about how to do cross-modal retrieval with the YouTube-8M dataset. I have videos with image and audio data, and would like to learn two encoders that embed both audio and RGB data into the same space, such that nearest neighbor lookups could be performed with audio embeddings to find related images, and vice versa.

Is there an easy way to extend the loss functions required by SimilarityModel to support two input heads?

Dataset signature: (features, labels) = ({'rgb': ..., 'audio': ...}, {'video_id': ...})

Jan 14 '22 14:01 carlthome

This would similar to the CLIP model. We are looking to add an example notebook for this at some point.

Jan 18 '22 22:01 owenvallis

Hi i was wondering if you did 'Import CLIP'?

Jan 13 '23 18:01 Layhan