CLIP icon indicating copy to clipboard operation
CLIP copied to clipboard

How can I reduce dimension of image feature by encode_image?

Open sdalinluo opened this issue 2 years ago • 6 comments

image = preprocess(Image.open(image_name)).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    write_image_tensor(item_id, image_features)

type of image_features is below: <class 'torch.Tensor'> torch.Size([1, 512]) How can I reduce dimension of image feature by encode_image? eg: torch.Size([1, 128]) or torch.Size([1, 64])

Many thanks.

sdalinluo avatar Feb 23 '23 10:02 sdalinluo

Facing a similar issue. Need to increase dimension to [1,1024]. Did you find any way @sdalinluo?

nityanandmathur avatar Mar 24 '23 19:03 nityanandmathur

It is not easy to increase/decrease the dimension of the image embedding without fine-tuning CLIP again. So, what we can do is attach a Projection layer at the end of the CLIP encoder and decoder models, and finetune these models again.

I solved similar problems using HuggingFace's implementation of CLIP. Below is the code you can also try out,

from transformers import (
    CLIPTextConfig,
    CLIPVisionConfig,
    CLIPTextModelWithProjection,
    CLIPVisionModelWithProjection
)

CLIP_CHECKPOINTS = ""openai/clip-vit-base-patch32"" 
PROJECTION_DIM = 512  # Replace with your desired dimensions
padding_max_length = 77  # default is 77 that clip uses, 

textConfig = CLIPTextConfig.from_pretrained(CLIP_CHECKPOINTS)
textConfig.projection_dim = PROJECTION_DIM
textConfig.max_position_embeddings = padding_max_length

visionConfig = CLIPVisionConfig.from_pretrained(CLIP_CHECKPOINTS)
visionConfig.projection_dim = PROJECTION_DIM

# Using CLIP text model with a projection head on top
clipTextModel = CLIPTextModelWithProjection.from_pretrained(
    pretrained_model_name_or_path=CLIP_CHECKPOINTS,
    config=textConfig,
    ignore_mismatched_sizes=True
)

# Using CLIP vision ViT model with a projection head on top 
clipVisionModel = CLIPVisionModelWithProjection.from_pretrained(
    pretrained_model_name_or_path=CLIP_CHECKPOINTS,
    config=visionConfig,
    ignore_mismatched_sizes=True
)
# Now train CLIP, so that the embeddings become meaningful..

``

hrithickcodes avatar Jul 08 '23 15:07 hrithickcodes

That's great! It's really helpful to set up config info when using CLIP

Young-Chin avatar Jul 31 '23 09:07 Young-Chin

Hi, Can you help with how should it be trained after applying the above code (like written in the comment at the end in the above code snippet)

khandelwalronak0809 avatar Jan 28 '24 00:01 khandelwalronak0809

Hi, Can you help with how should it be trained after applying the above code (like written in the comment at the end in the above code snippet)

In fact, I finally didn't use CLIP like that. But you may try to set the training mode like xxxmodel.train(), and consider it as a pretrained NN to finetune it in your way. Hope it is helpful.

Young-Chin avatar Jan 28 '24 06:01 Young-Chin

Thanks a lot. Will try that on my dataset.

khandelwalronak0809 avatar Jan 29 '24 10:01 khandelwalronak0809