open_clip Train Model With Multiple Input Images

Is it possible to change the model to accept more than one image as the input?

If I'm not mistaken, CLIP takes an image and a text as the inputs, extracts the features of these two inputs and finally gives us the logits of the distance of the image to the text.

So, is it possible to give two (or more) input images and extract ONE feature from the input images (just like before)?

I want to somehow mix the two inputs. For example, inputting an image alongside it's semantic segmentation as the input to the model. If it's possible, what parts of the code should I change? Or is this already implemented and usable?

Thanks.

Feb 18 '23 10:02 Neltherion

There are multiple ways to do this

Late fusion: Use the existing model on all your images then aggregate the embeddings (using a simple average or something stronger)
Early fusion: for this you would indeed need to adapt the openclip code, then produce a large such (N image, text) dataset then retrain

I advise to start with 1

On Sat, Feb 18, 2023, 11:42 Neltherion @.***> wrote:

Is it possible to change the model to accept more than one image as the input?

If I'm not mistaken, CLIP takes an image and a text as the inputs, extracts the features of these two inputs and finally gives us the logits of the distance of the image to the text.

So, is it possible to give two (or more) input images and extract ONE feature from the input images (just like before)?

I want to somehow mix the two inputs. For example, inputting an image alongside it's semantic segmentation as the input to the model. If it's possible, what parts of the code should I change? Or is this already implemented and usable?

Thanks.

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/435, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XU63JFLDISYV3GMGTWYCRQRANCNFSM6AAAAAAVAI2QZQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Feb 18 '23 12:02 rom1504

Thanks, I was really looking for approach #2😁 Is it possible and if it is, any hints on where to start?

Feb 18 '23 21:02 Neltherion

Maybe you could have a look at https://github.com/LAION-AI/temporal-embedding-aggregation first

On Sat, Feb 18, 2023, 22:27 Neltherion @.***> wrote:

Thanks, I was really looking for approach #2 https://github.com/mlfoundations/open_clip/issues/2😁 Is it possible and if it is, any hints on where to start?

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/435#issuecomment-1435772728, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437W4K5CH42PXB74QCS3WYE5EZANCNFSM6AAAAAAVAI2QZQ . You are receiving this because you commented.Message ID: @.***>

Feb 18 '23 22:02 rom1504