clip-as-service icon indicating copy to clipboard operation
clip-as-service copied to clipboard

How to club the two Image and Text embedding output by CLIP into one fixed length?

Open karndeepsingh opened this issue 2 years ago β€’ 5 comments

Hi, I have products images and its description and want to generate product embeddings using the image and description for each of the products. Can CLIP can help to generate embedding for products and club into one fixed length vector that represents the product embedding? If so, then once I train the CLIP model using image and text, how I can evaluate the efficiency and accuracy of the embeddings learned.

Please suggest. My objective is to generate single product embeddings using its image and description. waiting for your suggestion.

karndeepsingh avatar Jul 19 '22 10:07 karndeepsingh

Can CLIP can help to generate embedding for products and club into one fixed length vector that represents the product embedding?

the answer is yes. CLIP can produce text embedding and image embedding with text and image input respectively. As a consequence, for each product in your case, two embeddings will be generated for every modality. Then, you can simply use concat or add operator to convert the resulting two embeddings into one fixed-length vector. BTW, you can also direclty use the modality embedding separately for downstream application, and aggregate the predictions/results with different weights at the end.

then once I train the CLIP model using image and text, how I can evaluate the efficiency and accuracy of the embeddings learned.

To train and evaluate CLIP models with custom data, I suggest you can try this tool finetuner. This tool enables users to finetune and evaluate their embedding model with fewer efforts. For more details, you can check this doc https://finetuner.jina.ai/tasks/text-to-image/

numb3r3 avatar Jul 20 '22 04:07 numb3r3

Can CLIP can help to generate embedding for products and club into one fixed length vector that represents the product embedding?

the answer is yes. CLIP can produce text embedding and image embedding with text and image input respectively. As a consequence, for each product in your case, two embeddings will be generated for every modality. Then, you can simply use concat or add operator to convert the resulting two embeddings into one fixed-length vector. BTW, you can also direclty use the modality embedding separately for downstream application, and aggregate the predictions/results with different weights at the end.

then once I train the CLIP model using image and text, how I can evaluate the efficiency and accuracy of the embeddings learned.

To train and evaluate CLIP models with custom data, I suggest you can try this tool finetuner. This tool enables users to finetune and evaluate their embedding model with fewer efforts. For more details, you can check this doc https://finetuner.jina.ai/tasks/text-to-image/

@numb3r3 Thanks for answering. Just wanted to understand how I can evaluate it on my already trained CLIP model? Also, is finetuner chargeable or its free to use?

karndeepsingh avatar Jul 20 '22 06:07 karndeepsingh

how I can evaluate it on my already trained CLIP model? You can evaluate your model with the downstream task ., e.g., classification, retrieval, based on your dataset.

And finetuner is free to use. Please play it around, and give your comments!

numb3r3 avatar Jul 20 '22 09:07 numb3r3

how I can evaluate it on my already trained CLIP model? You can evaluate your model with the downstream task ., e.g., classification, retrieval, based on your dataset.

And finetuner is free to use. Please play it around, and give your comments!

I have few more questions:

  1. Do I need to train my CLIP model again with FineTuner?
  2. How I need to prepare data ? Right now, I have a dataframe with one column as "image path" and other column as. "caption" for those particular images. If you can share some examples or notebooks to utilize my work directly into the finetuner that would be great help.
  3. I can see, it needs my data need to be uploaded to the cloud and training has to be done at cloud. But i don't have permission to share the data on any outside server. Is it possible to use Finetuner on my machine directly instead of uploading the data to cloud and doing the training and evaluation there?

Thanks

karndeepsingh avatar Jul 20 '22 09:07 karndeepsingh

Do I need to train my CLIP model again with FineTuner?

That depends on your domain. The pre-trained CLIP model is trained with a large-scale corpus in the general domain. So, usually, we can archive decent results without fine-tuning. Of course, fine-tuning can leads to a better one always.

How I need to prepare data ? Right now, I have a dataframe with one column as "image path" and other column as. "caption" for those particular images. If you can share some examples or notebooks to utilize my work directly into the finetuner that would be great help.

Yes, you need convert your data into docarray format. checkout this page: https://finetuner.jina.ai/walkthrough/create-training-data/ tab 3

I can see, it needs my data need to be uploaded to the cloud and training has to be done at cloud. But i don't have permission to share the data on any outside server. Is it possible to use Finetuner on my machine directly instead of uploading the data to cloud and doing the training and evaluation there?

as for now, yes the only way is to upload it to cloud. While fine-tuning does not require β€œall” data, constructing several thousands piece of training data would be enough.

numb3r3 avatar Jul 22 '22 01:07 numb3r3

we will close this issue for now. If you have some news findings to share, you are welcome to create a new ticket Thanks!

numb3r3 avatar Aug 31 '22 04:08 numb3r3