CLIP How to evaluate the CLIP model results?

Hi, I have trained a clip model using image and its caption. Now, i want to evaluate the performance of the model like Precision, Recall, F1 Score. How I can do that?

Any suggestion would help to do so.

Thanks

Jul 18 '22 13:07 karndeepsingh

what limit you from doing so? you can simply calculate it using scikit learn, just put it inside the training loop. but i wonder what you mean by precision recall f1 in this context ? CLIP is not a classification model, this is just an embedding model, with an objective function that mimic a classification problem, even though it is not a classification. So it's better to state what are you tring to achieve with classification metrics here

Jul 18 '22 15:07 vinson2233

what limit you from doing so? you can simply calculate it using scikit learn, just put it inside the training loop. but i wonder what you mean by precision recall f1 in this context ? CLIP is not a classification model, this is just an embedding model, with an objective function that mimic a classification problem, even though it is not a classification. So it's better to state what are you tring to achieve with classification metrics here

That’s true. Let me explain what I am trying to achieve. So, I have products images and its description and want to generate product embeddings using the image and description for each of the products. Can CLIP can help to generate embedding for products? If so, then once I train the CLIP model using image and text, how I can evaluate efficiency and accuracy of the embeddings learned.

Please suggest. My objective is to generate a single product embeddings using its image and description. waiting for you suggestion.

Thanks

Jul 18 '22 17:07 karndeepsingh

Please suggest. My objective is to generate a single product embeddings using its image and description. waiting for you suggestion.

Thanks

Using an F1 score requires that you have two important values - y_true and y_pred. Although you can't exactly get those values using CLIP, you can still check the performance using F1 scores. If you've already trained the model, then evaluation is pretty straightforward:

create a validation set of images and their real descriptions
convert both of them via model.encode_image(image: Tensor) and model.encode_text(text: Tensor)

now there are multiple ways to compare these given encodings, one of them is the cosine similarity score used in model(image: Tensor, text: Tensor) theoretically, you can also use the F1 metric (or any other metric for that matter) because I believe the dimensions of both embeddings are the same. But do keep in mind that there's really no y_true in this case, just two different predicted encodings that need to be matched let me know if this worked for you

Jul 27 '22 09:07 PsVenom

Hi, Thanks for the answers @PsVenom . I would like to ask few more question that are as:

I have images of the product and its attributes like gender, color, type, size etc. how I can prepare text using these attributes so that I can use that text to pair with the image? And later while inferencing, I can get to know all the attributes of the new image. If you can give me some idea on preparing the text of the attribute so that I can train CLIP model for better context training.
While training official CLIP model did the author directly used class as its image description or did they prepare contextual sentence made of class for training CLIP model. For example: Image: "Image of a dog" , Class: "DOG". then did they use ( "Image of dog", "DOG") as (Image-Text) pair or did they use some contextual text made out of class like "A DOG playing in grass" and then Pairing its image.

Please help me to understand the text preparation step as I have classes of a particular product as its attributes like color, type, size, gender, etc and how i can prepare better description out of these attributes and then while inferencing I can extract these attributes from the new image.

Thanks.

Jul 30 '22 20:07 karndeepsingh