How to do cross-modal retrieval?
I'm curious about how to do cross-modal retrieval with the YouTube-8M dataset. I have videos with image and audio data, and would like to learn two encoders that embed both audio and RGB data into the same space, such that nearest neighbor lookups could be performed with audio embeddings to find related images, and vice versa.
Is there an easy way to extend the loss functions required by SimilarityModel to support two input heads?
Dataset signature:
(features, labels) = ({'rgb': ..., 'audio': ...}, {'video_id': ...})
This would similar to the CLIP model. We are looking to add an example notebook for this at some point.
Hi i was wondering if you did 'Import CLIP'?